<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: PeterC3.dev</title>
    <description>The latest articles on DEV Community by PeterC3.dev (@peterc3dev).</description>
    <link>https://dev.to/peterc3dev</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F3848804%2Fa17fba86-dbf9-4dbd-87bd-21c44822ba6f.png</url>
      <title>DEV Community: PeterC3.dev</title>
      <link>https://dev.to/peterc3dev</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/peterc3dev"/>
    <language>en</language>
    <item>
      <title>What If Your NPU Solved Problems Like Magnets Instead of Math?</title>
      <dc:creator>PeterC3.dev</dc:creator>
      <pubDate>Mon, 30 Mar 2026 09:43:27 +0000</pubDate>
      <link>https://dev.to/peterc3dev/what-if-your-npu-solved-problems-like-magnets-instead-of-math-5dc5</link>
      <guid>https://dev.to/peterc3dev/what-if-your-npu-solved-problems-like-magnets-instead-of-math-5dc5</guid>
      <description>&lt;p&gt;I've been building a &lt;a href="https://github.com/Peterc3-dev/rag-race-router" rel="noopener noreferrer"&gt;three-processor inference runtime&lt;/a&gt; for AMD Ryzen AI 300 APUs — CPU, GPU, and NPU working together. Last night, while thinking about how the NPU's systolic array actually works at the hardware level, something clicked.&lt;/p&gt;

&lt;h2&gt;
  
  
  The observation
&lt;/h2&gt;

&lt;p&gt;Every ML model today does the same thing: billions of multiply-accumulate operations. Multiply this matrix by that matrix. Add bias. Apply activation function. Repeat ten billion times.&lt;/p&gt;

&lt;p&gt;But that's not how I solve problems. I don't compute the answer — I &lt;em&gt;feel&lt;/em&gt; which answer fits. Like grabbing a handful of Magnetix (those magnetic construction toys) and shaking them until the pieces snap into the right configuration. You don't calculate the geometry. The magnetic fields resolve it for you.&lt;/p&gt;

&lt;p&gt;What if compute worked the same way?&lt;/p&gt;

&lt;h2&gt;
  
  
  Hyperdimensional Computing already exists
&lt;/h2&gt;

&lt;p&gt;This isn't science fiction. It's an active research field called &lt;strong&gt;Hyperdimensional Computing (HDC)&lt;/strong&gt; or &lt;strong&gt;Vector Symbolic Architectures (VSA)&lt;/strong&gt;, and it works exactly like the magnet analogy:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Data is encoded as &lt;strong&gt;10,000-dimensional vectors&lt;/strong&gt; (hypervectors). Not a single number. Not a small array. A point in a vast geometric space.&lt;/li&gt;
&lt;li&gt;Operations are &lt;strong&gt;XOR&lt;/strong&gt; (bind two concepts together), &lt;strong&gt;majority vote&lt;/strong&gt; (combine multiple vectors), and &lt;strong&gt;cosine similarity&lt;/strong&gt; (which stored pattern is closest to this input?).&lt;/li&gt;
&lt;li&gt;Answers emerge from &lt;strong&gt;geometric resonance&lt;/strong&gt; — the system finds the nearest match in high-dimensional space, like a magnet snapping to its complement.&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;Key properties that matter:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;10x more error-tolerant than neural networks.&lt;/strong&gt; When your representation is distributed across 10,000 dimensions, noise in any individual dimension cancels out. Corrupt 10% of the vector and the answer is still correct.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;No training loop.&lt;/strong&gt; You don't need gradient descent. Codebook entries are &lt;em&gt;constructed&lt;/em&gt; — you encode a known-good state as a hypervector and store it. To classify, you find the nearest stored vector. No backpropagation, no loss functions, no epochs.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Incremental learning without forgetting.&lt;/strong&gt; New patterns are added by bundling (majority vote). The system doesn't forget old patterns when it learns new ones. No catastrophic forgetting — the bane of neural networks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Massively parallel.&lt;/strong&gt; All similarity comparisons happen simultaneously. There's no sequential dependency between checking pattern A and pattern B.&lt;/p&gt;

&lt;h2&gt;
  
  
  Why XDNA 2's systolic array is a natural fit
&lt;/h2&gt;

&lt;p&gt;Here's where it gets interesting for AMD hardware.&lt;/p&gt;

&lt;p&gt;The XDNA 2 NPU in Ryzen AI 300 is a &lt;strong&gt;systolic array&lt;/strong&gt; — a grid of identical cells, each performing multiply-accumulate, passing data to neighbors in rhythmic pulses. Two perpendicular data flows cross at each cell. It's designed for matrix multiplication.&lt;/p&gt;

&lt;p&gt;HDC's core operation — "which stored pattern is most similar to this input?" — IS a matrix multiply:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;similarity = input_vector · codebook_entry^T
           = Σ(input[i] × entry[i])  for i in 10,000 dimensions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;That's a dot product. The systolic array does thousands of these per cycle. At 50 TOPS, the NPU can evaluate a 10,000-dimensional similarity search against a codebook of 1,000 entries in &lt;strong&gt;microseconds&lt;/strong&gt;. At 2 watts.&lt;/p&gt;

&lt;p&gt;The hardware is already there. The software abstraction connecting HDC to the XDNA 2 systolic array doesn't exist yet.&lt;/p&gt;

&lt;h2&gt;
  
  
  What this means for inference
&lt;/h2&gt;

&lt;p&gt;Current flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → billions of multiply-accumulate ops → Output
        (brute-force arithmetic, sequential)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;HDC flow:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Input → encode as hypervector → similarity search across learned field → Output snaps into place
         (geometric resolution, parallel, fault-tolerant)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The computation is the same hardware operation (matrix multiply for similarity), but the &lt;strong&gt;representation&lt;/strong&gt; is fundamentally different. Information is distributed holographically — every dimension contains partial information about every concept. Answers emerge from the geometry of the space, not from chaining arithmetic.&lt;/p&gt;

&lt;h2&gt;
  
  
  Where I'm applying this
&lt;/h2&gt;

&lt;p&gt;In &lt;a href="https://github.com/Peterc3-dev/rag-race-router" rel="noopener noreferrer"&gt;R.A.G-Race-Router&lt;/a&gt;, I'm building a runtime that dispatches ML operations across CPU, GPU, and NPU with learned routing. Currently, the dispatcher uses a small neural network trained via policy gradient — standard RL.&lt;/p&gt;

&lt;p&gt;The next step (&lt;a href="https://github.com/Peterc3-dev/rag-race-router/blob/master/ROADMAP.md" rel="noopener noreferrer"&gt;Phase 3 in the roadmap&lt;/a&gt;): replace the arithmetic dispatcher with an HDC routing layer.&lt;/p&gt;

&lt;p&gt;Instead of: &lt;code&gt;if gpu_temp &amp;gt; 75 and op_size &amp;gt; threshold: route_to_npu()&lt;/code&gt;&lt;/p&gt;

&lt;p&gt;This: encode the current system state (GPU temp, workload type, memory pressure, NPU availability) as a hypervector, then find the nearest match in a codebook of known-good routing configurations. The optimal routing &lt;strong&gt;snaps into place&lt;/strong&gt; via field resolution, not branching logic.&lt;/p&gt;

&lt;p&gt;Phase 4 goes further — using HDC as an inference primitive itself. Not just for routing decisions, but for the actual model inference. The answer self-assembles from field dynamics.&lt;/p&gt;

&lt;h2&gt;
  
  
  Can you train this?
&lt;/h2&gt;

&lt;p&gt;Yes, but differently than neural networks.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Constructive learning:&lt;/strong&gt; You don't train HDC with gradient descent. You construct codebook entries from observed data. See a good routing configuration? Encode it as a hypervector. Store it. Next time a similar state occurs, the system snaps to it.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Reinforcement from experience:&lt;/strong&gt; Run inference, measure the outcome (latency, quality, power draw), encode the result. Good outcomes strengthen their codebook entries (the hypervector gets bundled with more examples). Bad outcomes weaken theirs. The codebook evolves over time — not through backprop, but through accumulation.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Transfer learning is natural:&lt;/strong&gt; Because HDC encodes &lt;em&gt;structure&lt;/em&gt; not &lt;em&gt;weights&lt;/em&gt;, a codebook learned for one model can transfer to another. The routing patterns for "large matmul on hot GPU" are similar regardless of which model the matmul belongs to.&lt;/p&gt;

&lt;p&gt;The &lt;a href="https://github.com/hyperdimensional-computing/torchhd" rel="noopener noreferrer"&gt;torchhd library&lt;/a&gt; (&lt;code&gt;pip install torch-hd&lt;/code&gt;) implements all of this in PyTorch. The research bridge between HDC and the XDNA 2 systolic array is what I'm building.&lt;/p&gt;

&lt;h2&gt;
  
  
  What I'm not claiming
&lt;/h2&gt;

&lt;p&gt;This is a research direction, not a shipped product. HDC has shown strong results for classification tasks. Whether it works for generation (audio, text, images) is an open question. The quality bar for "sounds good" is much higher than "classifies correctly."&lt;/p&gt;

&lt;p&gt;But the hardware match is real. The XDNA 2 systolic array already does the math that HDC needs. Nobody has connected these two pieces on consumer APU hardware. That's the research spike.&lt;/p&gt;

&lt;h2&gt;
  
  
  Try it yourself
&lt;/h2&gt;

&lt;p&gt;If you're on Ryzen AI 300 hardware:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight shell"&gt;&lt;code&gt;pip &lt;span class="nb"&gt;install &lt;/span&gt;torch-hd
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;





&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight python"&gt;&lt;code&gt;&lt;span class="kn"&gt;import&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;

&lt;span class="c1"&gt;# Encode two concepts as 10,000-dimensional hypervectors
&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "GPU is hot"
&lt;/span&gt;&lt;span class="n"&gt;B&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "workload is large"
&lt;/span&gt;
&lt;span class="c1"&gt;# Bind them together (XOR-like)
&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;bind&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;A&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;B&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;  &lt;span class="c1"&gt;# "GPU is hot AND workload is large"
&lt;/span&gt;
&lt;span class="c1"&gt;# Compare against stored patterns
&lt;/span&gt;&lt;span class="n"&gt;pattern_npu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# Known-good: route to NPU
&lt;/span&gt;&lt;span class="n"&gt;pattern_gpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;random&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="mi"&gt;1&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="mi"&gt;10000&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;   &lt;span class="c1"&gt;# Known-good: route to GPU
&lt;/span&gt;
&lt;span class="c1"&gt;# Which pattern does the current state resemble most?
&lt;/span&gt;&lt;span class="n"&gt;sim_npu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern_npu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="n"&gt;sim_gpu&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;torchhd&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="nf"&gt;cosine_similarity&lt;/span&gt;&lt;span class="p"&gt;(&lt;/span&gt;&lt;span class="n"&gt;state&lt;/span&gt;&lt;span class="p"&gt;,&lt;/span&gt; &lt;span class="n"&gt;pattern_gpu&lt;/span&gt;&lt;span class="p"&gt;)&lt;/span&gt;
&lt;span class="c1"&gt;# → The higher similarity wins. Field resolution, not if/else.
&lt;/span&gt;&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The full roadmap, architecture, and Phase 1-2 implementation (working three-processor dispatch with learned routing) are at:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;&lt;a href="https://github.com/Peterc3-dev/rag-race-router" rel="noopener noreferrer"&gt;github.com/Peterc3-dev/rag-race-router&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Previous posts in this series:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;a href="https://dev.to/peterc3dev/your-amd-apu-has-three-processors-why-does-ml-only-use-one-4ibi"&gt;Your AMD APU Has Three Processors. Why Does ML Only Use One?&lt;/a&gt;&lt;/li&gt;
&lt;li&gt;&lt;a href="https://dev.to/peterc3dev/i-got-all-three-processors-talking-to-each-other-on-my-amd-laptop-29k3"&gt;I Got All Three Processors Talking to Each Other&lt;/a&gt;&lt;/li&gt;
&lt;/ul&gt;




&lt;p&gt;&lt;em&gt;I'm Peter Clemente (&lt;a href="https://github.com/Peterc3-dev" rel="noopener noreferrer"&gt;@Peterc3-dev&lt;/a&gt;). I build inference systems on AMD APUs running Linux. This project is part of CIN — a distributed inference network that treats every device as a node.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>amd</category>
      <category>machinelearning</category>
      <category>research</category>
      <category>opensource</category>
    </item>
    <item>
      <title>I Got All Three Processors Talking to Each Other on My AMD Laptop</title>
      <dc:creator>PeterC3.dev</dc:creator>
      <pubDate>Sun, 29 Mar 2026 14:31:40 +0000</pubDate>
      <link>https://dev.to/peterc3dev/i-got-all-three-processors-talking-to-each-other-on-my-amd-laptop-29k3</link>
      <guid>https://dev.to/peterc3dev/i-got-all-three-processors-talking-to-each-other-on-my-amd-laptop-29k3</guid>
      <description>&lt;p&gt;Last week I published &lt;a href="https://dev.to/peterc3/the-scheduler-that-learns-your-chip-building-tri-processor-inference-for-amd-ryzen-ai-300-5gci"&gt;an architecture&lt;/a&gt;. This week it runs.&lt;/p&gt;

&lt;p&gt;The R.A.G-Race-Router engine now dispatches real workloads across all three processors on my GPD Pocket 4 (AMD Ryzen AI 9 HX 370): CPU (Zen 5), iGPU (Radeon 890M via Vulkan), and NPU (XDNA 2 at 50 TOPS). Here's what actually happened when I wired it up.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Three-Processor Demo
&lt;/h2&gt;



&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Processors: CPU ok  GPU (Vulkan) ok  NPU (XDNA) ok
GPU Temp: 43C | VRAM: 1732/8192 MB | Pulse: READY

  tokenize  -&amp;gt; CPU  (7ms)     [lightweight, no dispatch overhead]
  embed     -&amp;gt; NPU  (282ms)   [embedding lookup, NPU efficient]
  matmul    -&amp;gt; GPU  (5ms)     [512x512 via Vulkan SPIR-V shader]
  attention -&amp;gt; GPU  (5ms)     [scaled dot-product, pulsed burst]
  normalize -&amp;gt; NPU  (5ms)     [RMS norm, NPU sweet spot]
  project   -&amp;gt; GPU  (6ms)     [linear projection, pulsed burst]
  decode    -&amp;gt; CPU  (6ms)     [greedy argmax, trivial]

Pipeline: 316ms total (16ms GPU, 287ms NPU, 13ms CPU)
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;Each operation is dispatched to the device the engine thinks is best, based on heuristics during the first few runs and learned routing rules after that. The personality database records every execution and gradually builds a profile of this specific chip.&lt;/p&gt;

&lt;h2&gt;
  
  
  The NPU Was Broken — We Fixed It
&lt;/h2&gt;

&lt;p&gt;The NPU on Strix Halo refused to initialize. The kernel driver's SMU (System Management Unit) failed with &lt;code&gt;smu cmd 4 failed, 0xff&lt;/code&gt; on every boot. Three sessions of debugging later:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Root cause&lt;/strong&gt;: The driver calls SMU init &lt;em&gt;before&lt;/em&gt; loading firmware via PSP (Platform Security Processor). On Strix Halo, the SMU doesn't respond until firmware is loaded. Classic init-order bug.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;The fix&lt;/strong&gt;: A three-line patch to the out-of-tree amdxdna driver that skips SMU when it fails, loads firmware via PSP anyway, and continues without power management. The NPU runs at default BIOS clocks.&lt;/p&gt;

&lt;p&gt;Result: Llama 3.2 1B at 40-46 tok/s prefill, 14-24 tok/s decode, running on the NPU via FastFlowLM. The patched driver loads automatically on boot via a systemd service.&lt;/p&gt;

&lt;h2&gt;
  
  
  Vulkan Is Still Faster Than ROCm on This GPU
&lt;/h2&gt;

&lt;p&gt;Updated benchmarks with the engine's Kompute integration:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Backend&lt;/th&gt;
&lt;th&gt;Workload&lt;/th&gt;
&lt;th&gt;Performance&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;CPU (NumPy/BLAS)&lt;/td&gt;
&lt;td&gt;512x512 matmul&lt;/td&gt;
&lt;td&gt;7.6ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU (Vulkan/Kompute)&lt;/td&gt;
&lt;td&gt;512x512 matmul&lt;/td&gt;
&lt;td&gt;5.0ms&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;GPU (Vulkan/IREE)&lt;/td&gt;
&lt;td&gt;1024x1024 matmul&lt;/td&gt;
&lt;td&gt;1,085 GFLOPS&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;NPU (FLM)&lt;/td&gt;
&lt;td&gt;Llama 3.2 1B prefill&lt;/td&gt;
&lt;td&gt;40-46 tok/s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;The Vulkan path uses pre-compiled SPIR-V shaders (matmul, attention, fused add-scale) dispatched through Kompute 0.9.0. ROCm's &lt;code&gt;hipMallocManaged&lt;/code&gt; remains broken on gfx1150 — Vulkan accesses the full VRAM+GTT pool while HIP only sees the BIOS carveout.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Engine Learns Your Chip
&lt;/h2&gt;

&lt;p&gt;After 5 runs, the personality database encodes routing rules:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Hardware Personality (35 runs):
Operation        Best on  Avg (ms)   Confidence
tokenize         cpu      0.04       100%
embed            npu      199.04     100%
matmul           gpu      0.50       (learning)
attention        gpu      0.53       (learning)
decode           cpu      0.06       100%
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The system learns that embedding is best on NPU, tokenization belongs on CPU, and matrix ops go to GPU. When GPU temperature spikes, the dispatcher reroutes small ops to NPU or CPU. Every reroute is logged and fed back into the personality.&lt;/p&gt;

&lt;h2&gt;
  
  
  Thermal Stress Test
&lt;/h2&gt;

&lt;p&gt;I pushed the GPU for 30 seconds of continuous compute to test adaptive rerouting:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;438 operations dispatched&lt;/li&gt;
&lt;li&gt;54 reroutes (matmul -&amp;gt; CPU when GPU was busy)&lt;/li&gt;
&lt;li&gt;GPU temperature: 45C -&amp;gt; 50C (well within thermal budget)&lt;/li&gt;
&lt;li&gt;Distribution: 380 GPU, 53 NPU, 5 CPU&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The pulsed execution model (burst on GPU, check temperature, cooldown if needed) prevents thermal throttling. The engine's pulse controller adapts the burst/cooldown ratio based on real-time temperature readings from amdgpu_top.&lt;/p&gt;

&lt;h2&gt;
  
  
  What's Next
&lt;/h2&gt;

&lt;p&gt;This is still pre-alpha. The dispatch overhead matters — for tiny operations, routing through the engine is slower than just running on CPU. The win is thermal management and sustained throughput for long-running workloads.&lt;/p&gt;

&lt;p&gt;Next steps:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;Route MusicGen audio generation through the engine (text encoder on CPU, decoder on pulsed GPU, EnCodec on CPU)&lt;/li&gt;
&lt;li&gt;Reduce dispatch overhead for small ops (batch scheduling)&lt;/li&gt;
&lt;li&gt;IREE integration for compiled NPU kernels&lt;/li&gt;
&lt;li&gt;Upstream the amdxdna SMU bypass patch&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The code is at &lt;a href="https://github.com/Peterc3-dev/rag-race-router" rel="noopener noreferrer"&gt;Peterc3-dev/rag-race-router&lt;/a&gt;. MIT license.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;This project is part of CIN (Collaborative Intelligence Network), a distributed inference system spanning a ThinkCentre M70q hub and this GPD Pocket 4 mobile workstation, connected via Tailscale mesh.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>amd</category>
      <category>machinelearning</category>
      <category>linux</category>
      <category>opensource</category>
    </item>
    <item>
      <title>Your AMD APU Has Three Processors. Why Does ML Only Use One?</title>
      <dc:creator>PeterC3.dev</dc:creator>
      <pubDate>Sun, 29 Mar 2026 06:57:13 +0000</pubDate>
      <link>https://dev.to/peterc3dev/your-amd-apu-has-three-processors-why-does-ml-only-use-one-4ibi</link>
      <guid>https://dev.to/peterc3dev/your-amd-apu-has-three-processors-why-does-ml-only-use-one-4ibi</guid>
      <description>&lt;p&gt;I've been staring at my AMD Ryzen AI HX 370 for months thinking the same thing: this chip has three processors that share a memory bus, and every ML runtime ignores two of them.&lt;/p&gt;

&lt;p&gt;The CPU runs inference. The GPU sits there unless you explicitly set it up. The NPU — 50 TOPS of dedicated neural compute at 2 watts — does literally nothing unless you're on Windows blurring your webcam background.&lt;/p&gt;

&lt;p&gt;What if a runtime used all three? And what if it &lt;em&gt;learned&lt;/em&gt; the optimal split for your specific chip?&lt;/p&gt;

&lt;h2&gt;
  
  
  The hardware nobody's exploiting
&lt;/h2&gt;

&lt;p&gt;The Ryzen AI 300 series is a monolithic die. CPU (Zen 5), iGPU (RDNA 3.5, 16 CUs), and NPU (XDNA 2) share physical DDR5X through one memory controller. No dedicated VRAM. True unified memory architecture.&lt;/p&gt;

&lt;p&gt;The theoretical pipeline:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;
&lt;strong&gt;NPU&lt;/strong&gt; handles efficient inference at 50 TOPS / 2W — your always-on workhorse&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;iGPU&lt;/strong&gt; handles flexible parallel compute — batch processing, larger models&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;CPU&lt;/strong&gt; orchestrates, preprocesses, and fills gaps&lt;/li&gt;
&lt;li&gt;A &lt;strong&gt;scheduler&lt;/strong&gt; (running on the NPU itself) learns which ops run best where on &lt;em&gt;your&lt;/em&gt; chip&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;After 47 generations, it tells you: &lt;em&gt;"guidance_scale=4.2 produces the cleanest output on this hardware."&lt;/em&gt; After a driver update: &lt;em&gt;"ROCm 7.3 improved GPU throughput 15% — redistributing layers."&lt;/em&gt;&lt;/p&gt;

&lt;h2&gt;
  
  
  I did the research. Here's what's actually possible today.
&lt;/h2&gt;

&lt;p&gt;I spent a week mapping every dependency, every driver, every research paper. The short version:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;What works (March 2026):&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;NPU inference via FastFlowLM: Llama 3.2 1B at ~60 tok/s, under 2 watts&lt;/li&gt;
&lt;li&gt;XDNA kernel driver mainlined in Linux 6.14&lt;/li&gt;
&lt;li&gt;iGPU inference via Vulkan llama.cpp (and it's 60% &lt;em&gt;faster&lt;/em&gt; than ROCm — more on that below)&lt;/li&gt;
&lt;li&gt;All three processors sharing physical memory&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;&lt;strong&gt;What doesn't:&lt;/strong&gt;&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;ONNX Runtime's Vitis AI EP is completely broken on Linux&lt;/li&gt;
&lt;li&gt;
&lt;code&gt;hipMallocManaged&lt;/code&gt; returns "not supported" on the 890M&lt;/li&gt;
&lt;li&gt;No DMA-BUF bridge between the GPU and NPU drivers&lt;/li&gt;
&lt;li&gt;Nobody has run all three processors simultaneously for inference&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  The Vulkan surprise
&lt;/h2&gt;

&lt;p&gt;Here's something the ROCm community hasn't fully absorbed: &lt;strong&gt;Vulkan outperforms ROCm by ~60% for prompt processing on the Radeon 890M.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The reason is memory access. ROCm's &lt;code&gt;hipMalloc&lt;/code&gt; can only address the BIOS-configured VRAM carveout. On a 96GB system, that might be 48GB. Vulkan sees the entire pool — VRAM plus GTT, 80+ GB — via &lt;code&gt;VK_MEMORY_PROPERTY_HOST_VISIBLE_BIT&lt;/code&gt;.&lt;/p&gt;

&lt;p&gt;On a memory-bandwidth-bound workload over a 120 GB/s LPDDR5X bus, that gap is decisive. For this project, Vulkan is the iGPU backend.&lt;/p&gt;

&lt;h2&gt;
  
  
  The NPU-as-scheduler concept
&lt;/h2&gt;

&lt;p&gt;This is the part that has no prior art in the literature. I checked.&lt;/p&gt;

&lt;p&gt;The idea: dedicate a few NPU columns to running a tiny scheduling neural network (~500 bytes, based on CoDL's latency predictor architecture) that monitors CPU and GPU utilization, thermal state, and inference latency — then dynamically redistributes operators across all three processors.&lt;/p&gt;

&lt;p&gt;The NPU is perfect for this because:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;It runs at &amp;lt;2W, always-on without thermal impact&lt;/li&gt;
&lt;li&gt;XDNA 2 supports dynamic spatial partitioning at column boundaries&lt;/li&gt;
&lt;li&gt;The remaining NPU columns still handle inference workloads&lt;/li&gt;
&lt;li&gt;It's literally a neural processor running a neural scheduling policy&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;The closest analogy is SmartNIC-as-orchestrator in distributed systems (Wave, Conspirator, RingLeader) — an auxiliary processor dedicated to scheduling decisions. Nobody's applied this pattern to NPUs.&lt;/p&gt;

&lt;h2&gt;
  
  
  First of its kind
&lt;/h2&gt;

&lt;p&gt;I reviewed 25+ papers, AMD's own 2025 scheduling research, and every major heterogeneous inference project I could find. Nobody has built this.&lt;/p&gt;

&lt;p&gt;CPU+GPU co-execution is well-studied (CoDL, SparOA). NPU+GPU scheduling exists in limited forms. But a self-optimizing runtime that dynamically partitions operators across all three processors on a consumer APU? No prior implementation. AMD's Karami et al. (2025) characterize the &lt;em&gt;problem&lt;/em&gt; — a combinatorial scheduling search space of O(2^125) — but don't build a runtime that solves it.&lt;/p&gt;

&lt;p&gt;To be clear: I've architected this and validated feasibility, not shipped a running system. But the architecture itself, the feasibility study, and the research synthesis represent the first open-source attempt at this category of runtime.&lt;/p&gt;

&lt;h2&gt;
  
  
  Six novel contributions
&lt;/h2&gt;

&lt;p&gt;From the full literature review, these have no existing implementation:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;
&lt;strong&gt;NPU-as-scheduling-agent&lt;/strong&gt; for CPU+GPU workload orchestration&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Persistent hardware personality&lt;/strong&gt; — an evolving model of your chip's specific behavior over weeks/months&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Three-processor dynamic operator placement&lt;/strong&gt; on a single SoC (CPU+GPU is studied; all three is not)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Cross-model transfer learning&lt;/strong&gt; for on-device scheduling (learning from Model A improves scheduling of Model B)&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;Vulkan+XRT memory bridge&lt;/strong&gt; — combining Vulkan's superior unified memory access with XRT buffer objects via CPU-mediated sharing&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NPU-bookended assembly line&lt;/strong&gt; — NPU dispatches at the start, assembles at the end; CPU and GPU are decoupled async producers. 1000:1 speed ratio makes scheduling overhead effectively zero&lt;/li&gt;
&lt;/ol&gt;

&lt;h2&gt;
  
  
  What's next
&lt;/h2&gt;

&lt;p&gt;I'm calling the project &lt;strong&gt;R.A.G-Race-Router&lt;/strong&gt; [Adaptive Tri-Processor Inference Runtime]. The runtime treats the three processors as an assembly line: CPU and GPU are asynchronous production belts, and the NPU bookends the pipeline — dispatching work at the start, assembling output at the end. At 50 TOPS, the NPU evaluates scheduling decisions in microseconds while CPU/GPU compute takes milliseconds. It appears to be in two places at once. After a few runs, it encodes the dispatch pattern as lightweight rules that auto-execute with near-zero overhead, only re-engaging when something changes.&lt;/p&gt;

&lt;p&gt;The full feasibility study, architecture, literature review, and Phase 1 build instructions are here:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;→ &lt;a href="https://github.com/Peterc3-dev/rag-race-router" rel="noopener noreferrer"&gt;github.com/Peterc3-dev/rag-race-router&lt;/a&gt;&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;Phase 1 is proving three-processor data flow on a Ryzen AI 300 under CachyOS. The immediate step is getting FastFlowLM running on the NPU and benchmarking the three-way pipeline.&lt;/p&gt;

&lt;p&gt;This is pre-alpha. No code yet — just architecture, validated feasibility, and a clear build path. If you're running Ryzen AI 300 on Linux and this resonates, I'd love to hear from you.&lt;/p&gt;

&lt;h2&gt;
  
  
  The bigger picture
&lt;/h2&gt;

&lt;p&gt;AMD shipped hardware that could redefine edge inference. The silicon is there. The drivers are (mostly) there. What's missing is a runtime that treats the whole SoC as a unified inference machine instead of three separate devices that happen to share a bus.&lt;/p&gt;

&lt;p&gt;Every chip is slightly different. Thermal characteristics, silicon lottery, memory controller behavior, driver versions. A runtime that learns &lt;em&gt;your&lt;/em&gt; chip's personality isn't an optimization — it's a new category.&lt;/p&gt;

&lt;p&gt;The models are coming. The question is whether any runtime will know how to actually use the hardware it's running on.&lt;/p&gt;




&lt;p&gt;&lt;em&gt;I'm Peter Clemente (&lt;a href="https://github.com/Peterc3-dev" rel="noopener noreferrer"&gt;@Peterc3-dev&lt;/a&gt;). I build systems on Linux. This project is part of a broader architecture called CIN — a distributed inference network that treats every device as a node.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>amd</category>
      <category>machinelearning</category>
      <category>linux</category>
      <category>opensource</category>
    </item>
  </channel>
</rss>
