I've been building a three-processor inference runtime for AMD Ryzen AI 300 APUs — CPU, GPU, and NPU working together. Last night, while thinking about how the NPU's systolic array actually works at the hardware level, something clicked.
The observation
Every ML model today does the same thing: billions of multiply-accumulate operations. Multiply this matrix by that matrix. Add bias. Apply activation function. Repeat ten billion times.
But that's not how I solve problems. I don't compute the answer — I feel which answer fits. Like grabbing a handful of Magnetix (those magnetic construction toys) and shaking them until the pieces snap into the right configuration. You don't calculate the geometry. The magnetic fields resolve it for you.
What if compute worked the same way?
Hyperdimensional Computing already exists
This isn't science fiction. It's an active research field called Hyperdimensional Computing (HDC) or Vector Symbolic Architectures (VSA), and it works exactly like the magnet analogy:
- Data is encoded as 10,000-dimensional vectors (hypervectors). Not a single number. Not a small array. A point in a vast geometric space.
- Operations are XOR (bind two concepts together), majority vote (combine multiple vectors), and cosine similarity (which stored pattern is closest to this input?).
- Answers emerge from geometric resonance — the system finds the nearest match in high-dimensional space, like a magnet snapping to its complement.
Key properties that matter:
10x more error-tolerant than neural networks. When your representation is distributed across 10,000 dimensions, noise in any individual dimension cancels out. Corrupt 10% of the vector and the answer is still correct.
No training loop. You don't need gradient descent. Codebook entries are constructed — you encode a known-good state as a hypervector and store it. To classify, you find the nearest stored vector. No backpropagation, no loss functions, no epochs.
Incremental learning without forgetting. New patterns are added by bundling (majority vote). The system doesn't forget old patterns when it learns new ones. No catastrophic forgetting — the bane of neural networks.
Massively parallel. All similarity comparisons happen simultaneously. There's no sequential dependency between checking pattern A and pattern B.
Why XDNA 2's systolic array is a natural fit
Here's where it gets interesting for AMD hardware.
The XDNA 2 NPU in Ryzen AI 300 is a systolic array — a grid of identical cells, each performing multiply-accumulate, passing data to neighbors in rhythmic pulses. Two perpendicular data flows cross at each cell. It's designed for matrix multiplication.
HDC's core operation — "which stored pattern is most similar to this input?" — IS a matrix multiply:
similarity = input_vector · codebook_entry^T
= Σ(input[i] × entry[i]) for i in 10,000 dimensions
That's a dot product. The systolic array does thousands of these per cycle. At 50 TOPS, the NPU can evaluate a 10,000-dimensional similarity search against a codebook of 1,000 entries in microseconds. At 2 watts.
The hardware is already there. The software abstraction connecting HDC to the XDNA 2 systolic array doesn't exist yet.
What this means for inference
Current flow:
Input → billions of multiply-accumulate ops → Output
(brute-force arithmetic, sequential)
HDC flow:
Input → encode as hypervector → similarity search across learned field → Output snaps into place
(geometric resolution, parallel, fault-tolerant)
The computation is the same hardware operation (matrix multiply for similarity), but the representation is fundamentally different. Information is distributed holographically — every dimension contains partial information about every concept. Answers emerge from the geometry of the space, not from chaining arithmetic.
Where I'm applying this
In R.A.G-Race-Router, I'm building a runtime that dispatches ML operations across CPU, GPU, and NPU with learned routing. Currently, the dispatcher uses a small neural network trained via policy gradient — standard RL.
The next step (Phase 3 in the roadmap): replace the arithmetic dispatcher with an HDC routing layer.
Instead of: if gpu_temp > 75 and op_size > threshold: route_to_npu()
This: encode the current system state (GPU temp, workload type, memory pressure, NPU availability) as a hypervector, then find the nearest match in a codebook of known-good routing configurations. The optimal routing snaps into place via field resolution, not branching logic.
Phase 4 goes further — using HDC as an inference primitive itself. Not just for routing decisions, but for the actual model inference. The answer self-assembles from field dynamics.
Can you train this?
Yes, but differently than neural networks.
Constructive learning: You don't train HDC with gradient descent. You construct codebook entries from observed data. See a good routing configuration? Encode it as a hypervector. Store it. Next time a similar state occurs, the system snaps to it.
Reinforcement from experience: Run inference, measure the outcome (latency, quality, power draw), encode the result. Good outcomes strengthen their codebook entries (the hypervector gets bundled with more examples). Bad outcomes weaken theirs. The codebook evolves over time — not through backprop, but through accumulation.
Transfer learning is natural: Because HDC encodes structure not weights, a codebook learned for one model can transfer to another. The routing patterns for "large matmul on hot GPU" are similar regardless of which model the matmul belongs to.
The torchhd library (pip install torch-hd) implements all of this in PyTorch. The research bridge between HDC and the XDNA 2 systolic array is what I'm building.
What I'm not claiming
This is a research direction, not a shipped product. HDC has shown strong results for classification tasks. Whether it works for generation (audio, text, images) is an open question. The quality bar for "sounds good" is much higher than "classifies correctly."
But the hardware match is real. The XDNA 2 systolic array already does the math that HDC needs. Nobody has connected these two pieces on consumer APU hardware. That's the research spike.
Try it yourself
If you're on Ryzen AI 300 hardware:
pip install torch-hd
import torchhd
# Encode two concepts as 10,000-dimensional hypervectors
A = torchhd.random(1, 10000) # "GPU is hot"
B = torchhd.random(1, 10000) # "workload is large"
# Bind them together (XOR-like)
state = torchhd.bind(A, B) # "GPU is hot AND workload is large"
# Compare against stored patterns
pattern_npu = torchhd.random(1, 10000) # Known-good: route to NPU
pattern_gpu = torchhd.random(1, 10000) # Known-good: route to GPU
# Which pattern does the current state resemble most?
sim_npu = torchhd.cosine_similarity(state, pattern_npu)
sim_gpu = torchhd.cosine_similarity(state, pattern_gpu)
# → The higher similarity wins. Field resolution, not if/else.
The full roadmap, architecture, and Phase 1-2 implementation (working three-processor dispatch with learned routing) are at:
github.com/Peterc3-dev/rag-race-router
Previous posts in this series:
- Your AMD APU Has Three Processors. Why Does ML Only Use One?
- I Got All Three Processors Talking to Each Other
I'm Peter Clemente (@Peterc3-dev). I build inference systems on AMD APUs running Linux. This project is part of CIN — a distributed inference network that treats every device as a node.
Top comments (0)