I needed a suffix vanity address for a Solana token. Every tool I found either:
- Claimed "GPU" but ran on CPU
- Only did prefix search
- Sent keys through a server
So I wrote one from scratch in CUDA. Here's the optimization journey.
Baseline: naive GPU port
Starting point: port TweetNaCl ed25519 to CUDA kernels.
Result: 311k keys/sec — slower than CPU (455k on 16 threads).
GPU parallelism is wasted if the per-thread work is slow. Time to fix the math.
Step 1: 8-bit signed-digit comb — 4.65M k/s (10x)
Standard scalar multiplication does 256 point doublings + 256 conditional adds.
The comb method splits the 256-bit scalar into 32 signed bytes, each indexing a precomputed table of j × 256^w × G for j ∈ [1..128], w ∈ [0..31] (4096 points, 480 KB — fits in L2 cache).
One scalar mult = 32 mixed additions instead of 512 operations.
Step 2: ref10 fe10 field arithmetic — 31M k/s (68x)
TweetNaCl uses 16×16-bit limbs. ref10 uses radix-2^25.5: 10 signed i32 limbs alternating 26/25 bits.
fe_mul drops from 256 partial products to 100, and uses mad.wide.s32 (1-cycle on Ampere) instead of slow i64 multiply.
Step 3: Montgomery batched inversion — 38M k/s (84x)
Each thread processes 4 candidates. Instead of 4 separate inversions:
- Forward pass: compute prefix products of Z coordinates
- One
inv25519on the final product - Backward pass: recover individual inverses
One inversion amortized across 4 candidates. Saves ~75% of inversion cost.
Step 4: Suffix mod-58^K prefilter — 44M k/s (97x)
After pack25519, compute pubkey_int mod 58^K where K = suffix length.
Compare against precomputed target. 99.99% of non-matching keys skip the full base58 encode entirely.
Final numbers
| Version | k/s |
|---|---|
| CPU 16 threads | 455k |
| GPU naive (TweetNaCl) | 311k |
| + 8-bit comb | 4.65M |
| + fe10 | 31M |
| + batched inversion | 38M |
| + mod-58^K prefilter | 44M |
RTX 3090, CUDA 13.1. Bottleneck: 255 registers/thread = Ampere max → 8 warps/SM occupancy.
Usage
vanity_gpu_sm86.exe pump # runs forever → pump_results.csv
vanity_gpu_sm86.exe pump 100 # find 100 matches and stop
Pre-built binaries for RTX 20xx / 30xx / 40xx. Private keys never leave your machine.
Top comments (0)