SMGP: Building an AI That Actually Remembers What You Told It
I spent a few months combining spectral graph theory, hyperdimensional computing, and a custom FPGA to solve the three things about LLMs that drive me crazy.
Large language models are impressive in a lot of ways. But after working with them for a while, three things started to really bother me:
- They forget everything between conversations. You can spend an hour giving context, and the next session it’s gone.
- The attention mechanism scales quadratically. Double your context, quadruple your compute. That’s why long documents get expensive fast.
- They make stuff up. Confidently. And there’s no built-in way to check if what they’re saying is actually true.
I wanted to see if there was a fundamentally different approach. Not patching transformers, not scaling them bigger. Something built from the ground up where memory, attention, and truthfulness aren’t afterthoughts.
That’s what SMGP (Spectral Memory Graph Processor) tries to be. It stores knowledge in a persistent graph encoded with hyperdimensional vectors, does attention in O(N log N) using graph Fourier analysis, and verifies every factual claim by checking if a path exists in the graph. If there’s no path, the claim doesn’t get made.
There’s also a full FPGA accelerator that offloads the heavy compute – but I’ll get to that.
How It’s Put Together
Everything centers on what I’m calling a Spectral Memory Graph. It’s a directed, typed property graph where:
· Each fact, concept, or word is a node. Nodes are addressed by 10,000-dimensional hyperdimensional bipolar vectors (+1 or -1).
· Relationships are typed edges, also encoded with HD vectors.
· The whole graph gets treated as a discrete manifold. All the signal processing happens in the spectral domain of the graph Laplacian.
The rough architecture looks like this:
+=============================================================+
| Application Layer |
| LLM | Knowledge Bases | Reasoning Agents | Chatbots |
+---------------------------+---------------------------------+
|
+=============================================================+
| Integration Layer |
| HuggingFace | LangChain | FastAPI | CLI | Python HAL |
+---------------------------+---------------------------------+
|
+=============================================================+
| Reasoning Layer |
| ClaimVerifier | NeuroSymbolicPlanner |
| DPO Rewrite Search |
+---------------------------+---------------------------------+
|
+=============================================================+
| Memory & Attention Layer |
| MemoryStore | AssociativeMemory | MemoryLifecycle |
| SpectralAttention: O(N log N) Multiscale |
+---------------------------+---------------------------------+
|
+=============================================================+
| Core Layer |
| SpectralMemoryGraph | HyperdimensionalMemory |
| SpectralMethods | TopologicalAnalyzer | GraphRewriter |
+=============================================================+
|
+================================+
| SOFTWARE: Python / CPU |
| NumPy | SciPy | GUDHI |
+--------------------------------+
|
+===========================================+
| HARDWARE: SMGPU FPGA |
| Spectral | HD | Topo | Rewrite Engines |
| NoC | HBM | CAM | Memristor Crossbar |
+===========================================+
The Math (Or: Why This Actually Works)
Hyperdimensional Memory
The memory system comes from hyper-dimensional computing. I first encountered HDC through Pentti Kanerva’s work at Berkeley, and it’s one of those ideas that feels almost too elegant. Every node and edge gets a bipolar vector v ∈ {-1, +1}ᴰ where D=10,000.
The operations are simple:
· Bind (XOR for bipolar vectors): a ⊕ b forms an association, like a key-value pair.
· Unbind : c ⊕ a recovers b if c = a ⊕ b.
· Bundle (majority sum): ⊕ᵢ vᵢ = sign(Σᵢ vᵢ) superposes multiple vectors together.
· Similarity : sim(a,b) = (1/D)aᵀb – just normalized dot product.
What’s wild is that these 10,000-dimensional vectors are incredibly robust. You can search across millions of nodes with dot products, and associative recall drops to O(1) when you implement it in content-addressable memory. The math guarantees that random high-dimensional vectors are nearly orthogonal, so interference between stored patterns is minimal.
Spectral Graph Processing
This is where things get interesting. I take the knowledge graph’s adjacency matrix A and build the normalized graph Laplacian:
L = I – D⁻¹/² A D⁻¹/²
where D is the degree matrix. The eigenvectors uₖ of L give you a basis that captures the graph’s structure across different frequencies. Low-frequency eigenvectors encode broad community structure. High-frequency ones capture local details.
The Graph Fourier Transform of a signal x on the graph is:
x̂(λₖ) = ⟨x, uₖ⟩
A spectral convolution with filter gθ becomes:
gθ ★ x = U gθ(Λ) Uᵀ x
where U holds the eigenvectors and Λ is the diagonal eigenvalue matrix. Full eigendecomposition costs O(N³), so I approximate gθ using a Chebyshev polynomial expansion of order K. That gets you down to O(K|E|) – manageable even for large graphs.
Attention runs in this spectral domain. Input tokens map to nodes in a context graph, and the spectral filter learns which frequencies to attend to. Coarsening the graph into a multiscale hierarchy via graph wavelets gives O(N log N) without losing global context. I was honestly surprised how well this worked in practice – the “lost middle” problem that plagues standard transformers just doesn’t show up.
Topological Persistence for Not Forgetting Things
Continuous learning usually means catastrophic forgetting. I use persistent homology to monitor the graph’s topology as it evolves. You build a filtration of the graph (adding edges in similarity order) and compute the persistence diagram – points (bᵢ, dᵢ) tracking when topological features appear and disappear.
When new knowledge arrives, I impose a stability constraint: the Wasserstein distance between old and new persistence diagrams can’t exceed a threshold ε. If an update would distort the global topology too much, it gets rejected or deferred. Existing knowledge stays intact.
Pruning works the other direction. Low-persistence features tend to be noise or rarely-used facts, so removing them keeps the memory footprint bounded without touching anything important. This is genuine lifelong learning – the graph grows without collapsing.
Category-Theoretic Graph Rewriting for Reasoning
Reasoning uses double-pushout (DPO) graph rewriting. A rule r = (L ← K → R) describes how to transform a graph pattern L into R. When the planner searches for an answer, it matches L in the knowledge graph and applies the rewrite. The sequence of rewrites becomes a proof trace – every step is a graph homomorphism.
If a claim like “Socrates taught Plato” can be derived as a valid rewrite path, it’s verified. If no such path exists, the claim gets rejected. Every output has a constructive proof behind it or it doesn’t get generated. I’m not claiming this eliminates every possible error – garbage in, garbage out still applies – but it does eliminate the class of hallucination where the model just confidently invents things.
The Software Side
The Python package (pip install smgp) splits into layers:
| Module | What It Does |
| ------------------------- | ----------------------------------------------------------------------- |
| `core.graph` | `SpectralMemoryGraph` — the central knowledge graph with HD addressing |
| `core.hyperdim` | `HyperdimensionalMemory` — bind, unbind, bundle, similarity search |
| `core.spectral` | `SpectralMethods` — Laplacian, eigendecomposition, GFT, Chebyshev conv |
| `core.topology` | `TopologicalAnalyzer` — persistent homology, pruning |
| `core.category` | `GraphRewriter` — DPO-based rewriting |
| `memory` | `MemoryStore`, `AssociativeMemory`, `MemoryLifecycle`, pruning policies |
| `attention.spectral_attn` | `SpectralAttention` — O(N log N) multiscale attention |
| `reasoning` | `ClaimVerifier`, `NeuroSymbolicPlanner` with explainable variants |
| `integration` | HuggingFace, LangChain, FastAPI, ONNX/vLLM, vector DBs, federation |
| `utils` | Auto-tuning, multimodal embeddings, CLI, benchmarks |
Using it as an LLM replacement:
from smgp.integration.huggingface import SMGPForCausalLM
model = SMGPForCausalLM(graph_hd_dim=10000)
model.add_knowledge("Socrates", "person")
model.add_knowledge("Socrates", "Plato", "taught")
output = model.generate(prompt="Who did Socrates teach?", verify_claims=True)
SMGPForCausalLM subclasses PreTrainedModel and slots into any HuggingFace pipeline. The standard transformer feed-forward gets replaced with spectral attention over the knowledge graph, and generation calls the verifier to ground outputs. LangChain users get a persistent SMGPMemory and SMGPVerifierTool for agents that can fact-check themselves.
The Hardware Accelerator (SMGPU)
The spectral transforms, HD operations, and topology calculations add up. Running everything in pure Python on a CPU works for small graphs, but it’s not fast. I designed SMGPU as a domain-specific accelerator in synthesisable SystemVerilog, targeting the Xilinx Alveo U280 (8 GB HBM2, 250 MHz).
Microarchitecture
+---------------------------+
| PCIe / AXI Host |
+-------------+-------------+
|
+-------------v-------------+
| ISA Decoder |
| 32-bit Instruction |
| Fetch & Dispatch |
+------+------+------+-------+
| | |
+----------------------+ | +----------------------+
| | |
+--------v---------+ +--------v----------+ +--------v---------+
| Spectral Engine | | HD Engine | | Topology Engine |
| - Laplacian | | - Bundle: Maj | | - Union-Find |
| - Eigen-decomp | | - Bind/Unbind | | - Barcode Emit |
| - Chebyshev | | XOR | | - Wasserstein |
| - GFT / Wavelet | | - Permute: Shift | | - Stability |
| 16x16 Systolic | | - Similarity | | Check |
+--------+---------+ | 16 Parallel Banks| +--------+--------+
| +---------+----------+ |
| | |
+--------v---------+ +---------v----------+ +--------v---------+
| Rewrite Engine | | Associative Cache | | Memristor |
| - DPO Match |<-------->| CAM, 256 Entries |<-------->| Crossbar Array |
| - DPO Apply | | - O(1) Recall | | - 1024x1024 |
| - Proof Trace | | - 1024-dim HD Keys | | - Analog MVM |
+--------+---------+ +--------------------+ +-----------------+
|
+--------v----------+ +-------------+-------------+ +-------------------+
| Graph DMA |<--->| 2D Mesh NoC Router |<--->| HBM Controller |
| Scatter-Gather | | 5-port, XY Routing | | 4-ch, 256-bit |
+-------------------+ +---------------------------+ +---------+---------+
|
+---------v---------+
| HBM2 External |
+-------------------+
Compute Engines
| Engine | Architecture | Key Operations | Pipeline |
| -------- | ------------------------------- | -------------------------------------------------------------- | ------------ |
| Spectral | 16×16 systolic array | Laplacian, eigen-decomp, Chebyshev conv, GFT, wavelets | 4 stages |
| HD | 16 parallel banks | Bundle majority, bind/unbind XOR, permute, similarity popcount | 1–20 cycles |
| Topology | Union-Find with parallel prefix | Filtration, persistent homology, barcode, Wasserstein | 6 stages |
| Rewrite | Backtracking search FSM | DPO pattern match, rule apply, proof trace | 10-state FSM |
Memory Subsystem
| Component | Specification |
| ------------------ | --------------------------------------------------------------------- |
| Associative Cache | 256-entry CAM, 1024-dim projected keys, O(N) parallel lookup |
| HBM2 Controller | 4 channels, 256-bit data, burst-16, AXI4-Stream |
| Memristor Crossbar | Behavioral, 1024×1024 conductance cells, 8-bit resolution, analog MVM |
Interconnect
| Component | Specification |
| --------- | --------------------------------------------------------------------- |
| NoC | 2D mesh, 5-port routers, XY deterministic routing, 2 virtual channels |
| DMA | Scatter-gather, 256-entry descriptor queue, CSR-aware graph traversal |
Instruction Set
All instructions are 32-bit:
[31:28] opcode. – NOP, GRAPH_CTOR, SPECTRAL, HD, TOPOLOGY, REWRITE, MEMORY, SYSTEM
[27:24] sub_opcode. – sub-operation
[23:16] flags. – START, DONE_IRQ, CHAINED, STREAMING, BLOCKED, PRECISION_Q
[15:0] operand. – address, immediate, or register select
8 opcode classes with 3 – 7 sub-opcodes each. The Python HAL translates high-level graph operations into these instructions. If the FPGA isn’t available, everything falls back to pure Python automatically.
from smgp.core.graph import SpectralMemoryGraph
from hardware.sw.smgp_hal.executor import HWExecutor
executor = HWExecutor(device="/dev/smgpu0", fallback=True)
graph = SpectralMemoryGraph(hd_dim=10000, seed=42, executor=executor)
graph.add_edge("Socrates", "Plato", "taught")
from smgp.core.spectral import SpectralMethods
spectral = SpectralMethods(graph)
eigenvalues, _ = spectral.compute_eigen() # runs on FPGA at 250 MHz
Benchmarks
Software (Pure Python)
| Operation | Throughput (queries/s) | Latency (μs) | Memory (MB) |
| --------------------------------------------- | ---------------------: | -----------: | ----------: |
| Graph node creation | 12,000 | 83 | 0.8 |
| HD similarity search: 10K-dim, 10K candidates | 2,400 | 416 | 381 |
| Claim verification: 100-node graph | 8,100 | 123 | 1.2 |
| Spectral attention: 1K tokens, 128-dim | 850 | 1,176 | 45 |
AMD Ryzen 9 5950X, 32 GB RAM, single-threaded.
Hardware Acceleration (FPGA @ 250 MHz)
| Operation | Latency | Throughput | Speed-up vs. CPU |
| -------------------------- | ---------: | ------------------: | ---------------: |
| Laplacian: 4K nodes | ~82 μs | 49 M edges/s | 38× |
| Eigen-decomp: K=64 | ~5.2 ms | 12.3 K iterations/s | 15× |
| Chebyshev conv: K=4 | ~0.33 ms | 12.2 M nodes/s | 22× |
| HD bind/unbind: 10K-dim | 1 cycle | 250 M ops/s | 80× |
| HD similarity: 10K-dim | 313 cycles | 0.8 M queries/s | 12× |
| Topology barcode: 4K nodes | ~1.4 ms | 2.9 K graphs/s | 8× |
| Wasserstein distance | ~0.7 ms | 1.4 K comparisons/s | 11× |
| DPO match: pattern 4 nodes | ~32 μs | 31 K patterns/s | 24× |
CPU baseline: Intel Xeon Gold 6242, 32 cores. FPGA: Alveo U280, Vivado 2023.2, 250 MHz.
The HD bind/unbind hitting 80× speed-up makes sense – those are just XOR operations in parallel across 10,000-bit vectors. The topology numbers are lower because Union-Find has inherent serial dependencies that are harder to parallelize. I’m still looking at whether a different algorithm for persistence could close that gap.
FPGA Resource Utilization (Alveo U280, 16×16 systolic)
| Resource | Used | Available | Utilization |
| -------- | -----: | --------: | ----------: |
| LUTs | 95,328 | 1,043,000 | 9.1% |
| FFs | 78,112 | 2,086,000 | 3.7% |
| BRAM | 120 | 1,872 | 6.4% |
| DSPs | 384 | 9,024 | 4.3% |
| URAM | 48 | 960 | 5.0% |
Plenty of room on the chip. The design is parameterized – you can scale the systolic array from 8×8 up to 32×32 depending on what fits. There’s also an OpenLane ASIC flow targeting SkyWater 130 nm if you want to go beyond FPGAs.
Testing
| Layer | Tests | Result | Method |
| ------------------------ | --------------: | ------------ | ----------------------------------------- |
| Python core | 210+ | All passing | pytest, Hypothesis property-based testing |
| Python integrations | 20+ | All passing | pytest with mock services |
| RTL unit tests | 6 engines | 6/6 | Verilator 5.048, Cocotb |
| Golden model comparisons | 2: spectral, HD | Bit-accurate | Python vs. RTL fixed-point |
| CI regression benchmarks | 8 | Tracked | Airspeed Velocity: ASV |
The end-to-end RTL test loads a small knowledge graph, runs spectral attention, verifies a DPO rewrite, and checks the output against the software golden model. That’s been the most useful test for catching integration bugs between engines.
Running It in the Cloud
You don’t need your own FPGA. The Chameleon Cloud testbed (NSF-funded, free for research) has bare-metal Alveo U280 nodes.
Build the bitstream locally (requires Vivado 2023.2):
cd hardware/fpga/build
make synth BOARD=xcu280
make impl BOARD=xcu280
make xclbin. # produces smgp_u280.xclbin
Reserve a Chameleon node, upload the .xclbin, program the FPGA, and you’re running the same Python code with hardware acceleration. Full instructions are in the repo.
What’s Next
The HAL-to-core wiring is designed but needs merging. The PCIe driver skeleton exists but needs the register map filled in to match the RTL. I’m also looking at multi-FPGA scaling – partitioning the knowledge graph across several accelerators – and at formal verification of the ISA.
Power measurement is on the list too. I want to see how SMGPU compares to GPU inference in joules per query, not just latency. That might be the more interesting number.
Where This Came From
I got tired of patching the same problems over and over. Better prompting, RAG, fine-tuning – they all help around the edges but don’t change the fact that transformers have no persistent state and no built-in notion of truth. SMGP is an attempt to build something where memory, attention, and verification aren’t add-ons. They’re the foundation.
If you want to dig in, contribute, or talk about taping out an ASIC, the repo is at github.com/rotsl/smgp.
The thing I keep thinking about: we’ve spent a long time optimizing the transformer architecture within its inherent constraints. What if the constraints themselves are the problem?
This project draws on ideas from spectral graph theory (Chung, 1997), hyperdimensional computing (Kanerva, 1988; Plate, 1995), topological data analysis (Edelsbrunner et al., 2002), and algebraic graph transformation (Ehrig et al., 2006). The full references are in the repository.

Top comments (0)