DEV Community

WIOWIZ Technologies
WIOWIZ Technologies

Posted on

VHE: GPU-Accelerated Gate-Level Simulation at Zero License Cost

published: true
Description: "How we built a GPU simulator to verify a 6.7M gate NPU when Verilator failed"

canonical_url: https://wiowiz.com/vhe-virtual-hardware-emulator.html

The Problem

Our NPU design hit 1.4 million gates. Verilator started a convolution test.

Runtime: 139 billion cycles
VCD trace: 56 GB
Status: Killed after 3 days
Enter fullscreen mode Exit fullscreen mode

Commercial emulators cost alot. We're a startup in India. That wasn't happening.

What We Built

VHE (Virtual Hardware Emulator) — GPU-accelerated gate-level simulation.

┌─────────────┐    ┌─────────────┐    ┌─────────────┐    ┌─────────────┐
│   Yosys     │───▶│   Parser    │───▶│  Levelizer  │───▶│    CUDA     │
│  JSON Net   │    │  (Python)   │    │  (DAG sort) │    │   Kernel    │
└─────────────┘    └─────────────┘    └─────────────┘    └─────────────┘
                                                               │
                                                               ▼
                                                        ┌─────────────┐
                                                        │  Simulation │
                                                        │   Output    │
                                                        └─────────────┘
Enter fullscreen mode Exit fullscreen mode

The Journey (Real Numbers)

Design Gates VHE Speed vs Verilator
PicoRV32 8K 11,063 cyc/s 100× faster
mor1kx 1.25M 2,941 cyc/s 27× faster
GEMMX 1.4M 1,465 cyc/s 13× faster
WZ-NPU 6.7M 3,444 cyc/s Verilator: DNF

Architecture Deep Dive

Phase 1: Levelization

Gates form a DAG. We topologically sort them into "levels" — gates at level N depend only on gates at levels < N.

while changed:
    for gate in gates:
        gate.level = max(input.level for input in gate.inputs) + 1
Enter fullscreen mode Exit fullscreen mode

Our 6.7M gate NPU: 447 logic levels.

Phase 2: GPU Dispatch

Each level is a CUDA kernel launch. Gates within a level evaluate in parallel.

Level 0: 12,847 gates  → 1 kernel, 12,847 threads
Level 1: 8,234 gates   → 1 kernel, 8,234 threads
...
Level 447: 156 gates   → 1 kernel, 156 threads
Enter fullscreen mode Exit fullscreen mode

Challenges We Hit

  1. Levelization cap: Initial algorithm hit 100-iteration limit. Fixed with proper visited tracking.

  2. Memory management: 6.7M gates × 4 bytes × 2 (current + next) = 54 MB state. Fits in GPU memory.

  3. Timing accuracy: Phase 1 is zero-delay (functional). Phase 2 adds SDF timing (in progress).

Why Not Verilator?

Verilator is great for RTL. But at gate-level with millions of cells:

  • VCD traces explode (56 GB for one test)
  • Single-threaded evaluation
  • No GPU acceleration

VHE trades generality for speed. We only support gate-level netlists from Yosys. That's all we need.

The Proof

We used VHE to verify WZ-NPU (our open-source NPU):

Test Description Result
VHE-F1 Deterministic GEMM ✅ PASS
VHE-F2 Random GEMM (10 seeds) ✅ PASS
VHE-F3 Reset/Start torture ✅ PASS
VHE-P1 Tile scaling equivalence ✅ PASS
VHE-S1 Backpressure stress ✅ PASS

What's Next

  • SDF timing annotation (post-layout accuracy)
  • 4-value logic (X, Z propagation)
  • Waveform export (VCD/FST)
  • Integration with formal tools

Links


Top comments (0)