How I optimized a Solana vanity address grinder to 44M keys/sec on GPU

Anton — Wed, 29 Apr 2026 19:07:50 +0000

I needed a suffix vanity address for a Solana token. Every tool I found either:

Claimed "GPU" but ran on CPU
Only did prefix search
Sent keys through a server

So I wrote one from scratch in CUDA. Here's the optimization journey.

Baseline: naive GPU port

Starting point: port TweetNaCl ed25519 to CUDA kernels.
Result: 311k keys/sec — slower than CPU (455k on 16 threads).

GPU parallelism is wasted if the per-thread work is slow. Time to fix the math.

Step 1: 8-bit signed-digit comb — 4.65M k/s (10x)

Standard scalar multiplication does 256 point doublings + 256 conditional adds.

The comb method splits the 256-bit scalar into 32 signed bytes, each indexing a precomputed table of j × 256^w × G for j ∈ [1..128], w ∈ [0..31] (4096 points, 480 KB — fits in L2 cache).

One scalar mult = 32 mixed additions instead of 512 operations.

Step 2: ref10 fe10 field arithmetic — 31M k/s (68x)

TweetNaCl uses 16×16-bit limbs. ref10 uses radix-2^25.5: 10 signed i32 limbs alternating 26/25 bits.

fe_mul drops from 256 partial products to 100, and uses mad.wide.s32 (1-cycle on Ampere) instead of slow i64 multiply.

Step 3: Montgomery batched inversion — 38M k/s (84x)

Each thread processes 4 candidates. Instead of 4 separate inversions:

Forward pass: compute prefix products of Z coordinates
One inv25519 on the final product
Backward pass: recover individual inverses

One inversion amortized across 4 candidates. Saves ~75% of inversion cost.

Step 4: Suffix mod-58^K prefilter — 44M k/s (97x)

After pack25519, compute pubkey_int mod 58^K where K = suffix length.
Compare against precomputed target. 99.99% of non-matching keys skip the full base58 encode entirely.

Final numbers

Version	k/s
CPU 16 threads	455k
GPU naive (TweetNaCl)	311k
+ 8-bit comb	4.65M
+ fe10	31M
+ batched inversion	38M
+ mod-58^K prefilter	44M

RTX 3090, CUDA 13.1. Bottleneck: 255 registers/thread = Ampere max → 8 warps/SM occupancy.

Usage

vanity_gpu_sm86.exe pump        # runs forever → pump_results.csv
vanity_gpu_sm86.exe pump 100    # find 100 matches and stop

Pre-built binaries for RTX 20xx / 30xx / 40xx. Private keys never leave your machine.

→ https://github.com/alhimikix/solana-suffix-gpu

DEV Community: Anton