DEV Community

Cover image for How I Built Two Generations of Neuromorphic Processor From Scratch
Catalyst
Catalyst

Posted on

How I Built Two Generations of Neuromorphic Processor From Scratch

Your brain runs on about 20 watts. It processes visual scenes, generates speech, and maintains balance — all simultaneously, all in real time. The best GPU clusters in the world burn megawatts to approximate what 86 billion neurons do effortlessly.

Neuromorphic processors try to close that gap. Instead of shuttling numbers through ALUs, they mimic biology: neurons fire discrete spikes, synapses carry weighted connections, and computation only happens when something actually changes. Intel's Loihi chip demonstrated this could work at scale. But Loihi is proprietary, and access requires going through Intel's cloud service.

So I built my own. Two generations. From scratch. Solo. As a university student.

Catalyst N1: The Foundation

N1 was the first generation, targeting feature parity with Intel's Loihi 1. It's a 128-core processor where each core contains 1,024 CUBA (current-based) leaky integrate-and-fire neurons and 131,072 synapses in compressed sparse row (CSR) format.

The headline features:

  • Programmable microcode learning engine — 16 registers, 14 opcodes. Each core runs a small program after every timestep implementing STDP, three-factor reward learning, homeostatic normalization, or any custom rule. No RTL changes needed.
  • Dendritic compartment trees — 4 compartments per neuron with configurable join operations (ADD, ABS_MAX, OR, PASS). Dendrites do local nonlinear processing before signals reach the soma.
  • 8-bit graded spikes — neurons carry intensity information, not just fire/no-fire. This actually exceeds Loihi 1, which doesn't have graded spikes at all — Intel only added them in Loihi 2.
  • 24-bit state precision — one bit more than Loihi 1's 23-bit, with RAZ (round-away-from-zero) arithmetic that prevents neurons from getting stuck at non-resting potentials.
  • Triple RV32IMF RISC-V cluster — three embedded processors with IEEE 754 FPU, hardware breakpoints, and shared mailbox for supervisory control.

N1 was validated through 25 RTL testbenches covering 98 test scenarios with zero failures. The SDK at this stage had 168 tests across 14 Python modules.

Catalyst N2: Programmable Neurons

If N1 was about matching Loihi 1, N2 was about the same architectural leap Intel made from Loihi 1 to Loihi 2: making the neuron programmable.

In N1, every neuron runs the same hardcoded CUBA LIF computation. Functional, but limiting — you can't do bursting, adaptation, oscillation, or graded error coding without changing the RTL.

N2 replaces the fixed datapath with a fetch-execute microcode engine. Each neuron runs its own program from instruction SRAM. A per-neuron program offset register means different neurons in the same core can run different programs. The register file (R0-R15) is loaded from neuron parameter SRAMs each timestep, and the instruction set includes arithmetic, shifts, min/max, conditional skips, and two spike emission modes (HALT for threshold-based, EMIT for forced payload).

This is the same shift that happened in graphics — from fixed-function pixel pipelines to programmable shaders. Once the neuron is programmable, the hardware becomes a platform rather than a fixed implementation.

Five Neuron Models

N2 ships with five neuron models, all implemented as microcode programs:

  1. CUBA LIF — bit-identical to N1's fixed path. The microcode program reproduces the exact same spike trains as the hardcoded datapath.
  2. Izhikevich — two-variable quadratic model with four presets (regular spiking, intrinsic bursting, chattering, fast spiking). Uses MUL_SHIFT for the v²/2^s quadratic term.
  3. Adaptive LIF — adds a slow adaptation current that accumulates on spikes and decays exponentially. Produces spike-frequency adaptation.
  4. Sigma-Delta — maintains a running prediction of input; emits the prediction error as a spike payload via EMIT. Achieves temporal sparsity for slowly-varying signals.
  5. Resonate-and-Fire — damped oscillator that fires only when driven at its resonant frequency. No spectral computation needed.

Everything Else N2 Adds

  • 4 graded spike payload formats (0/8/16/24-bit) — up from 8-bit only in N1
  • Variable-precision weight packing (1/2/4/8/16-bit) — 16x memory compression at 1-bit. Loihi 2 only goes to 8-bit; N2's 9-16 bit range is useful for networks requiring higher precision.
  • 5 spike traces (x1, x2, y1, y2, y3) — up from 2 in N1. Enables triplet STDP (Pfister & Gerstner 2006) and complex eligibility traces.
  • Convolutional synapse encoding — stores weight kernels once per group; 2-3x memory reduction for CNN topologies.
  • Per-synapse-group plasticity enable — 30-70% learning phase speedup in mixed fixed/plastic networks.
  • Persistent reward traces with exponential decay — enables temporal credit assignment for reinforcement learning.
  • Homeostatic threshold plasticity — epoch-based proportional error rule, prevents firing rate drift in recurrent networks.
  • Full observability — 3 performance counters, 25-variable state probes per neuron, 64-deep trace FIFO, and energy metering.
  • Hardware-accurate simulation defaults — 24-bit fixed-point arithmetic, strict SRAM pool depth limits matching RTL.

FPGA Validation

N2 was physically validated on an AWS F2 instance (Xilinx VU47P):

  • 16-core instance at 62.5 MHz neuromorphic clock / 250 MHz PCIe
  • 28/28 integration tests passing
  • 9 RTL-level tests generating 163,000+ spikes with zero mismatches
  • Dual-clock CDC with gray-code async FIFOs
  • ~8,690 timesteps/second throughput

BRAM is the binding constraint — 56% aggregate utilization for 16 cores. The full 128-core design is validated in simulation but would need a larger device or multi-FPGA partitioning.

85.9% on Spiking Heidelberg Digits

To validate the full pipeline, I trained a recurrent SNN on the Spiking Heidelberg Digits (SHD) dataset — 10,420 spoken digit recordings encoded as 700-channel cochlea spike trains.

Architecture: 700 input → 768 recurrent hidden → 20 output. Training uses surrogate gradients (fast sigmoid) with AdamW. After quantizing weights to 16 bits for hardware deployment, accuracy drops only 0.4% — from 85.9% to 85.4%. This surpasses published baselines from Cramer et al. (83.2%) and Zenke & Vogels (83.4%).

The SDK: 3,091 Tests

The SDK grew 18x between N1 and N2:

N1 N2
Test cases 168 3,091
Python modules 14 88
Neuron models 1 5
Synapse formats 3 4
Weight precisions 1 5
Features 155 (152 FULL, 3 HW_ONLY)
Lines of Python ~8K ~52K

Three backends (CPU cycle-accurate, GPU via PyTorch, FPGA) with the same deploy/step/get_result API. The GPU simulator achieves 100-1000x speedup over CPU.

Cloud API

Don't want to install anything? Use Catalyst Cloud:

pip install catalyst-cloud
Enter fullscreen mode Exit fullscreen mode
from catalyst_cloud import CatalystClient

client = CatalystClient(api_key="your_key")

network = {
    "populations": [
        {"name": "input", "n": 100, "params": {"threshold": 1000}},
        {"name": "output", "n": 10, "params": {"threshold": 600}}
    ],
    "connections": [
        {"from": "input", "to": "output", "weight": 500, "probability": 0.3}
    ]
}

job = client.submit(network, timesteps=1000)
result = job.wait()
print(result.spike_counts)
Enter fullscreen mode Exit fullscreen mode

Free tier for research. No credit card needed.

Try It

Licensed under BSL 1.1 — source-available, free for research, commercial use requires a paid licence.

238 development phases. Two processors. 3,091 tests. Built by one person at the University of Aberdeen.

If you're working on SNNs, neuromorphic computing, or alternative computing projects, I'd love to hear from you.

Top comments (0)