published: true
Description: "How we built a GPU simulator to verify a 6.7M gate NPU when Verilator failed"

canonical_url: https://wiowiz.com/vhe-virtual-hardware-emulator.html
The Problem
Our NPU design hit 1.4 million gates. Verilator started a convolution test.
Runtime: 139 billion cycles
VCD trace: 56 GB
Status: Killed after 3 days
Commercial emulators cost alot. We're a startup in India. That wasn't happening.
What We Built
VHE (Virtual Hardware Emulator) — GPU-accelerated gate-level simulation.
┌─────────────┐ ┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Yosys │───▶│ Parser │───▶│ Levelizer │───▶│ CUDA │
│ JSON Net │ │ (Python) │ │ (DAG sort) │ │ Kernel │
└─────────────┘ └─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Simulation │
│ Output │
└─────────────┘
The Journey (Real Numbers)
| Design | Gates | VHE Speed | vs Verilator |
|---|---|---|---|
| PicoRV32 | 8K | 11,063 cyc/s | 100× faster |
| mor1kx | 1.25M | 2,941 cyc/s | 27× faster |
| GEMMX | 1.4M | 1,465 cyc/s | 13× faster |
| WZ-NPU | 6.7M | 3,444 cyc/s | Verilator: DNF |
Architecture Deep Dive
Phase 1: Levelization
Gates form a DAG. We topologically sort them into "levels" — gates at level N depend only on gates at levels < N.
while changed:
for gate in gates:
gate.level = max(input.level for input in gate.inputs) + 1
Our 6.7M gate NPU: 447 logic levels.
Phase 2: GPU Dispatch
Each level is a CUDA kernel launch. Gates within a level evaluate in parallel.
Level 0: 12,847 gates → 1 kernel, 12,847 threads
Level 1: 8,234 gates → 1 kernel, 8,234 threads
...
Level 447: 156 gates → 1 kernel, 156 threads
Challenges We Hit
Levelization cap: Initial algorithm hit 100-iteration limit. Fixed with proper visited tracking.
Memory management: 6.7M gates × 4 bytes × 2 (current + next) = 54 MB state. Fits in GPU memory.
Timing accuracy: Phase 1 is zero-delay (functional). Phase 2 adds SDF timing (in progress).
Why Not Verilator?
Verilator is great for RTL. But at gate-level with millions of cells:
- VCD traces explode (56 GB for one test)
- Single-threaded evaluation
- No GPU acceleration
VHE trades generality for speed. We only support gate-level netlists from Yosys. That's all we need.
The Proof
We used VHE to verify WZ-NPU (our open-source NPU):
| Test | Description | Result |
|---|---|---|
| VHE-F1 | Deterministic GEMM | ✅ PASS |
| VHE-F2 | Random GEMM (10 seeds) | ✅ PASS |
| VHE-F3 | Reset/Start torture | ✅ PASS |
| VHE-P1 | Tile scaling equivalence | ✅ PASS |
| VHE-S1 | Backpressure stress | ✅ PASS |
What's Next
- SDF timing annotation (post-layout accuracy)
- 4-value logic (X, Z propagation)
- Waveform export (VCD/FST)
- Integration with formal tools
Links
- WZ-NPU GitHub: https://gitlab.com/wiowiztechnologies/wz-npu
- WIOWIZ: https://wiowiz.com
Top comments (0)