DEV Community: PeterC3.dev

What If Your NPU Solved Problems Like Magnets Instead of Math?

PeterC3.dev — Mon, 30 Mar 2026 09:43:27 +0000

I've been building a three-processor inference runtime for AMD Ryzen AI 300 APUs — CPU, GPU, and NPU working together. Last night, while thinking about how the NPU's systolic array actually works at the hardware level, something clicked.

The observation

Every ML model today does the same thing: billions of multiply-accumulate operations. Multiply this matrix by that matrix. Add bias. Apply activation function. Repeat ten billion times.

But that's not how I solve problems. I don't compute the answer — I feel which answer fits. Like grabbing a handful of Magnetix (those magnetic construction toys) and shaking them until the pieces snap into the right configuration. You don't calculate the geometry. The magnetic fields resolve it for you.

What if compute worked the same way?

Hyperdimensional Computing already exists

This isn't science fiction. It's an active research field called Hyperdimensional Computing (HDC) or Vector Symbolic Architectures (VSA), and it works exactly like the magnet analogy:

Data is encoded as 10,000-dimensional vectors (hypervectors). Not a single number. Not a small array. A point in a vast geometric space.
Operations are XOR (bind two concepts together), majority vote (combine multiple vectors), and cosine similarity (which stored pattern is closest to this input?).
Answers emerge from geometric resonance — the system finds the nearest match in high-dimensional space, like a magnet snapping to its complement.

Key properties that matter:

10x more error-tolerant than neural networks. When your representation is distributed across 10,000 dimensions, noise in any individual dimension cancels out. Corrupt 10% of the vector and the answer is still correct.

No training loop. You don't need gradient descent. Codebook entries are constructed — you encode a known-good state as a hypervector and store it. To classify, you find the nearest stored vector. No backpropagation, no loss functions, no epochs.

Incremental learning without forgetting. New patterns are added by bundling (majority vote). The system doesn't forget old patterns when it learns new ones. No catastrophic forgetting — the bane of neural networks.

Massively parallel. All similarity comparisons happen simultaneously. There's no sequential dependency between checking pattern A and pattern B.

Why XDNA 2's systolic array is a natural fit

Here's where it gets interesting for AMD hardware.

The XDNA 2 NPU in Ryzen AI 300 is a systolic array — a grid of identical cells, each performing multiply-accumulate, passing data to neighbors in rhythmic pulses. Two perpendicular data flows cross at each cell. It's designed for matrix multiplication.

HDC's core operation — "which stored pattern is most similar to this input?" — IS a matrix multiply:

similarity = input_vector · codebook_entry^T
           = Σ(input[i] × entry[i])  for i in 10,000 dimensions

That's a dot product. The systolic array does thousands of these per cycle. At 50 TOPS, the NPU can evaluate a 10,000-dimensional similarity search against a codebook of 1,000 entries in microseconds. At 2 watts.

The hardware is already there. The software abstraction connecting HDC to the XDNA 2 systolic array doesn't exist yet.

What this means for inference

Current flow:

Input → billions of multiply-accumulate ops → Output
        (brute-force arithmetic, sequential)

HDC flow:

Input → encode as hypervector → similarity search across learned field → Output snaps into place
         (geometric resolution, parallel, fault-tolerant)

The computation is the same hardware operation (matrix multiply for similarity), but the representation is fundamentally different. Information is distributed holographically — every dimension contains partial information about every concept. Answers emerge from the geometry of the space, not from chaining arithmetic.

Where I'm applying this

In R.A.G-Race-Router, I'm building a runtime that dispatches ML operations across CPU, GPU, and NPU with learned routing. Currently, the dispatcher uses a small neural network trained via policy gradient — standard RL.

The next step (Phase 3 in the roadmap): replace the arithmetic dispatcher with an HDC routing layer.

Instead of: if gpu_temp > 75 and op_size > threshold: route_to_npu()

This: encode the current system state (GPU temp, workload type, memory pressure, NPU availability) as a hypervector, then find the nearest match in a codebook of known-good routing configurations. The optimal routing snaps into place via field resolution, not branching logic.

Phase 4 goes further — using HDC as an inference primitive itself. Not just for routing decisions, but for the actual model inference. The answer self-assembles from field dynamics.

Can you train this?

Yes, but differently than neural networks.

Constructive learning: You don't train HDC with gradient descent. You construct codebook entries from observed data. See a good routing configuration? Encode it as a hypervector. Store it. Next time a similar state occurs, the system snaps to it.

Reinforcement from experience: Run inference, measure the outcome (latency, quality, power draw), encode the result. Good outcomes strengthen their codebook entries (the hypervector gets bundled with more examples). Bad outcomes weaken theirs. The codebook evolves over time — not through backprop, but through accumulation.

Transfer learning is natural: Because HDC encodes structure not weights, a codebook learned for one model can transfer to another. The routing patterns for "large matmul on hot GPU" are similar regardless of which model the matmul belongs to.

The torchhd library (pip install torch-hd) implements all of this in PyTorch. The research bridge between HDC and the XDNA 2 systolic array is what I'm building.

What I'm not claiming

This is a research direction, not a shipped product. HDC has shown strong results for classification tasks. Whether it works for generation (audio, text, images) is an open question. The quality bar for "sounds good" is much higher than "classifies correctly."

But the hardware match is real. The XDNA 2 systolic array already does the math that HDC needs. Nobody has connected these two pieces on consumer APU hardware. That's the research spike.

Try it yourself

If you're on Ryzen AI 300 hardware:

pip install torch-hd

import torchhd

# Encode two concepts as 10,000-dimensional hypervectors
A = torchhd.random(1, 10000)  # "GPU is hot"
B = torchhd.random(1, 10000)  # "workload is large"

# Bind them together (XOR-like)
state = torchhd.bind(A, B)  # "GPU is hot AND workload is large"

# Compare against stored patterns
pattern_npu = torchhd.random(1, 10000)   # Known-good: route to NPU
pattern_gpu = torchhd.random(1, 10000)   # Known-good: route to GPU

# Which pattern does the current state resemble most?
sim_npu = torchhd.cosine_similarity(state, pattern_npu)
sim_gpu = torchhd.cosine_similarity(state, pattern_gpu)
# → The higher similarity wins. Field resolution, not if/else.

The full roadmap, architecture, and Phase 1-2 implementation (working three-processor dispatch with learned routing) are at:

github.com/Peterc3-dev/rag-race-router

Previous posts in this series:

I'm Peter Clemente (@Peterc3-dev). I build inference systems on AMD APUs running Linux. This project is part of CIN — a distributed inference network that treats every device as a node.

I Got All Three Processors Talking to Each Other on My AMD Laptop

PeterC3.dev — Sun, 29 Mar 2026 14:31:40 +0000

Last week I published an architecture. This week it runs.

The R.A.G-Race-Router engine now dispatches real workloads across all three processors on my GPD Pocket 4 (AMD Ryzen AI 9 HX 370): CPU (Zen 5), iGPU (Radeon 890M via Vulkan), and NPU (XDNA 2 at 50 TOPS). Here's what actually happened when I wired it up.

The Three-Processor Demo

Processors: CPU ok  GPU (Vulkan) ok  NPU (XDNA) ok
GPU Temp: 43C | VRAM: 1732/8192 MB | Pulse: READY

  tokenize  -> CPU  (7ms)     [lightweight, no dispatch overhead]
  embed     -> NPU  (282ms)   [embedding lookup, NPU efficient]
  matmul    -> GPU  (5ms)     [512x512 via Vulkan SPIR-V shader]
  attention -> GPU  (5ms)     [scaled dot-product, pulsed burst]
  normalize -> NPU  (5ms)     [RMS norm, NPU sweet spot]
  project   -> GPU  (6ms)     [linear projection, pulsed burst]
  decode    -> CPU  (6ms)     [greedy argmax, trivial]

Pipeline: 316ms total (16ms GPU, 287ms NPU, 13ms CPU)

Each operation is dispatched to the device the engine thinks is best, based on heuristics during the first few runs and learned routing rules after that. The personality database records every execution and gradually builds a profile of this specific chip.

The NPU Was Broken — We Fixed It

The NPU on Strix Halo refused to initialize. The kernel driver's SMU (System Management Unit) failed with smu cmd 4 failed, 0xff on every boot. Three sessions of debugging later:

Root cause: The driver calls SMU init before loading firmware via PSP (Platform Security Processor). On Strix Halo, the SMU doesn't respond until firmware is loaded. Classic init-order bug.

The fix: A three-line patch to the out-of-tree amdxdna driver that skips SMU when it fails, loads firmware via PSP anyway, and continues without power management. The NPU runs at default BIOS clocks.

Result: Llama 3.2 1B at 40-46 tok/s prefill, 14-24 tok/s decode, running on the NPU via FastFlowLM. The patched driver loads automatically on boot via a systemd service.

Vulkan Is Still Faster Than ROCm on This GPU

Updated benchmarks with the engine's Kompute integration:

Backend	Workload	Performance
CPU (NumPy/BLAS)	512x512 matmul	7.6ms
GPU (Vulkan/Kompute)	512x512 matmul	5.0ms
GPU (Vulkan/IREE)	1024x1024 matmul	1,085 GFLOPS
NPU (FLM)	Llama 3.2 1B prefill	40-46 tok/s

The Vulkan path uses pre-compiled SPIR-V shaders (matmul, attention, fused add-scale) dispatched through Kompute 0.9.0. ROCm's hipMallocManaged remains broken on gfx1150 — Vulkan accesses the full VRAM+GTT pool while HIP only sees the BIOS carveout.

The Engine Learns Your Chip

After 5 runs, the personality database encodes routing rules:

Hardware Personality (35 runs):
Operation        Best on  Avg (ms)   Confidence
tokenize         cpu      0.04       100%
embed            npu      199.04     100%
matmul           gpu      0.50       (learning)
attention        gpu      0.53       (learning)
decode           cpu      0.06       100%

The system learns that embedding is best on NPU, tokenization belongs on CPU, and matrix ops go to GPU. When GPU temperature spikes, the dispatcher reroutes small ops to NPU or CPU. Every reroute is logged and fed back into the personality.

Thermal Stress Test

I pushed the GPU for 30 seconds of continuous compute to test adaptive rerouting:

438 operations dispatched
54 reroutes (matmul -> CPU when GPU was busy)
GPU temperature: 45C -> 50C (well within thermal budget)
Distribution: 380 GPU, 53 NPU, 5 CPU

The pulsed execution model (burst on GPU, check temperature, cooldown if needed) prevents thermal throttling. The engine's pulse controller adapts the burst/cooldown ratio based on real-time temperature readings from amdgpu_top.

What's Next

This is still pre-alpha. The dispatch overhead matters — for tiny operations, routing through the engine is slower than just running on CPU. The win is thermal management and sustained throughput for long-running workloads.

Next steps:

Route MusicGen audio generation through the engine (text encoder on CPU, decoder on pulsed GPU, EnCodec on CPU)
Reduce dispatch overhead for small ops (batch scheduling)
IREE integration for compiled NPU kernels
Upstream the amdxdna SMU bypass patch

The code is at Peterc3-dev/rag-race-router. MIT license.

This project is part of CIN (Collaborative Intelligence Network), a distributed inference system spanning a ThinkCentre M70q hub and this GPD Pocket 4 mobile workstation, connected via Tailscale mesh.

Your AMD APU Has Three Processors. Why Does ML Only Use One?

PeterC3.dev — Sun, 29 Mar 2026 06:57:13 +0000

I've been staring at my AMD Ryzen AI HX 370 for months thinking the same thing: this chip has three processors that share a memory bus, and every ML runtime ignores two of them.

The CPU runs inference. The GPU sits there unless you explicitly set it up. The NPU — 50 TOPS of dedicated neural compute at 2 watts — does literally nothing unless you're on Windows blurring your webcam background.

What if a runtime used all three? And what if it learned the optimal split for your specific chip?

The hardware nobody's exploiting

The Ryzen AI 300 series is a monolithic die. CPU (Zen 5), iGPU (RDNA 3.5, 16 CUs), and NPU (XDNA 2) share physical DDR5X through one memory controller. No dedicated VRAM. True unified memory architecture.

The theoretical pipeline:

NPU handles efficient inference at 50 TOPS / 2W — your always-on workhorse
iGPU handles flexible parallel compute — batch processing, larger models
CPU orchestrates, preprocesses, and fills gaps
A scheduler (running on the NPU itself) learns which ops run best where on your chip

After 47 generations, it tells you: "guidance_scale=4.2 produces the cleanest output on this hardware." After a driver update: "ROCm 7.3 improved GPU throughput 15% — redistributing layers."

I did the research. Here's what's actually possible today.

I spent a week mapping every dependency, every driver, every research paper. The short version:

What works (March 2026):

NPU inference via FastFlowLM: Llama 3.2 1B at ~60 tok/s, under 2 watts
XDNA kernel driver mainlined in Linux 6.14
iGPU inference via Vulkan llama.cpp (and it's 60% faster than ROCm — more on that below)
All three processors sharing physical memory

What doesn't:

ONNX Runtime's Vitis AI EP is completely broken on Linux
hipMallocManaged returns "not supported" on the 890M
No DMA-BUF bridge between the GPU and NPU drivers
Nobody has run all three processors simultaneously for inference

The Vulkan surprise

Here's something the ROCm community hasn't fully absorbed: Vulkan outperforms ROCm by ~60% for prompt processing on the Radeon 890M.

The reason is memory access. ROCm's hipMalloc can only address the BIOS-configured VRAM carveout. On a 96GB system, that might be 48GB. Vulkan sees the entire pool — VRAM plus GTT, 80+ GB — via VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT.

On a memory-bandwidth-bound workload over a 120 GB/s LPDDR5X bus, that gap is decisive. For this project, Vulkan is the iGPU backend.

The NPU-as-scheduler concept

This is the part that has no prior art in the literature. I checked.

The idea: dedicate a few NPU columns to running a tiny scheduling neural network (~500 bytes, based on CoDL's latency predictor architecture) that monitors CPU and GPU utilization, thermal state, and inference latency — then dynamically redistributes operators across all three processors.

The NPU is perfect for this because:

It runs at <2W, always-on without thermal impact
XDNA 2 supports dynamic spatial partitioning at column boundaries
The remaining NPU columns still handle inference workloads
It's literally a neural processor running a neural scheduling policy

The closest analogy is SmartNIC-as-orchestrator in distributed systems (Wave, Conspirator, RingLeader) — an auxiliary processor dedicated to scheduling decisions. Nobody's applied this pattern to NPUs.

First of its kind

I reviewed 25+ papers, AMD's own 2025 scheduling research, and every major heterogeneous inference project I could find. Nobody has built this.

CPU+GPU co-execution is well-studied (CoDL, SparOA). NPU+GPU scheduling exists in limited forms. But a self-optimizing runtime that dynamically partitions operators across all three processors on a consumer APU? No prior implementation. AMD's Karami et al. (2025) characterize the problem — a combinatorial scheduling search space of O(2^125) — but don't build a runtime that solves it.

To be clear: I've architected this and validated feasibility, not shipped a running system. But the architecture itself, the feasibility study, and the research synthesis represent the first open-source attempt at this category of runtime.

Six novel contributions

From the full literature review, these have no existing implementation:

NPU-as-scheduling-agent for CPU+GPU workload orchestration
Persistent hardware personality — an evolving model of your chip's specific behavior over weeks/months
Three-processor dynamic operator placement on a single SoC (CPU+GPU is studied; all three is not)
Cross-model transfer learning for on-device scheduling (learning from Model A improves scheduling of Model B)
Vulkan+XRT memory bridge — combining Vulkan's superior unified memory access with XRT buffer objects via CPU-mediated sharing
NPU-bookended assembly line — NPU dispatches at the start, assembles at the end; CPU and GPU are decoupled async producers. 1000:1 speed ratio makes scheduling overhead effectively zero

What's next

I'm calling the project R.A.G-Race-Router [Adaptive Tri-Processor Inference Runtime]. The runtime treats the three processors as an assembly line: CPU and GPU are asynchronous production belts, and the NPU bookends the pipeline — dispatching work at the start, assembling output at the end. At 50 TOPS, the NPU evaluates scheduling decisions in microseconds while CPU/GPU compute takes milliseconds. It appears to be in two places at once. After a few runs, it encodes the dispatch pattern as lightweight rules that auto-execute with near-zero overhead, only re-engaging when something changes.

The full feasibility study, architecture, literature review, and Phase 1 build instructions are here:

→ github.com/Peterc3-dev/rag-race-router

Phase 1 is proving three-processor data flow on a Ryzen AI 300 under CachyOS. The immediate step is getting FastFlowLM running on the NPU and benchmarking the three-way pipeline.

This is pre-alpha. No code yet — just architecture, validated feasibility, and a clear build path. If you're running Ryzen AI 300 on Linux and this resonates, I'd love to hear from you.

The bigger picture

AMD shipped hardware that could redefine edge inference. The silicon is there. The drivers are (mostly) there. What's missing is a runtime that treats the whole SoC as a unified inference machine instead of three separate devices that happen to share a bus.

Every chip is slightly different. Thermal characteristics, silicon lottery, memory controller behavior, driver versions. A runtime that learns your chip's personality isn't an optimization — it's a new category.

The models are coming. The question is whether any runtime will know how to actually use the hardware it's running on.

I'm Peter Clemente (@Peterc3-dev). I build systems on Linux. This project is part of a broader architecture called CIN — a distributed inference network that treats every device as a node.