Implementing Pollard's Kangaroo Algorithm on CUDA
Pollard's Kangaroo is an algorithm for solving the elliptic curve discrete logarithm problem (ECDLP) when the private key is known to lie in a specific range. It runs in O(sqrt(n)) time where n is the range size -- meaning a 2^64 range needs about 2^32 steps, not 2^64.
I implemented this on CUDA and got it running at about 1 billion steps per second on an RTX 3080 Ti. Along the way, I hit a subtle mathematical bug that took 10 debugging sessions to find, and it taught me more about the algorithm than any paper or textbook.
How the Algorithm Works
Imagine a number line from 0 to N (the search range). You're looking for a specific point T on this line -- the target private key.
You release two types of kangaroos:
Tame kangaroos start at a known position (the midpoint of the range). They hop forward by random amounts, recording their path. After enough hops, they've established a trail of "footprints" at known positions.
Wild kangaroos start at an unknown position (derived from the target public key). They use the same hopping rule. Since the hops are deterministic based on the current position, if a wild kangaroo ever lands on the same point as a tame kangaroo, they'll follow identical paths from that point forward.
When this collision happens, you can compute:
private_key = tame_distance - wild_distance
The expected number of steps before a collision is O(sqrt(n)).
Why GPU?
Each kangaroo walk is independent -- different starting points, independent state. This maps perfectly to GPU threads: one kangaroo per thread, thousands running in parallel.
But each step is expensive: a full secp256k1 point addition (12 field multiplications × 64 multiply-accumulate operations each = ~768 32-bit operations per step). The GPU earns its keep through raw parallelism.
The Implementation
Jump Table
Each kangaroo's jump distance is determined by a hash of its current position. I use a table of 32 precomputed jump distances, selecting based on the low bits of the current x-coordinate:
jump_index = point.x & 0x1F // low 5 bits
jump_distance = jump_table[jump_index]
new_point = current_point + jump_table_points[jump_index]
total_distance += jump_distance
The jump table is stored in constant memory -- broadcast to all threads for free.
Kernel Structure
Two separate kernels run alternately:
ComputeTameKangaroos: Each thread manages one tame kangaroo. Starting point is known (midpoint + offset). Hops forward, accumulating distance.ComputeWildKangaroos: Each thread manages one wild kangaroo. Starting point is the target public key + random offset. Same hopping rule.
Each kernel runs for a batch of steps (configurable), then control returns to the CPU for collision checking.
Distinguished Points
Checking every single point for collisions would require transferring enormous amounts of data from GPU to CPU. Instead, I use the distinguished point method: only points where hash(x) mod 2^d == 0 are reported.
This reduces the data transfer by a factor of 2^d, at the cost of slightly overshooting the actual collision point by an average of 2^d steps. For d=20, about 1 in a million points is distinguished.
256-bit Arithmetic
Each thread needs:
- 3 × 256-bit coordinates for the current point (Jacobian: X, Y, Z)
- 1 × 256-bit accumulator for total distance
- Temporary variables for field arithmetic
That's ~128 bytes of register space per thread before temporaries. With 255 registers available per thread and each register holding 4 bytes, we have room -- but it's tight. I use GRP_SIZE=4 (process 4 steps per kernel invocation per thread) to amortize kernel launch overhead without increasing register pressure.
The Negation Map Bug
Here's where the story gets interesting. Standard implementations of Pollard's Kangaroo use a "negation map" optimization: if the y-coordinate of the current point indicates it's in the "negative half" of the curve, negate both the point and the distance. This effectively doubles the chance of collisions because equivalent points are mapped to a canonical form.
I implemented this. It seemed to work on small test cases. Then on larger ranges, the algorithm would run for 10x the expected time with zero valid collisions.
10 Sessions of Debugging
Sessions 1-3: "Maybe the jump table is wrong." Verified every entry. Table was correct.
Sessions 4-5: "Maybe the GPU arithmetic has a bug." Compared GPU output against a Python reference implementation for every step. Arithmetic was correct.
Sessions 6-7: "Maybe the collision detection is broken." Added extensive logging. Collisions were being detected, but the recovered keys were wrong.
Sessions 8-9: "Maybe the distance accumulation has an overflow." Added overflow checks everywhere. No overflows.
Session 10: Found it.
The Root Cause
The negation map preserves an invariant for tame kangaroos:
distance * G = point // tame invariant
After negation: (N - distance) * G = -point. The invariant still holds because we negated both sides.
For wild kangaroos, the invariant is different:
target + distance * G = point // wild invariant
After negation: (N - distance) * G = -point. But target + (N - distance) * G = target - point ≠ -point. The invariant breaks.
The negation map assumes the relationship between distance and point is distance * G = point. For wild kangaroos, there's an additive constant (the target public key) that doesn't participate in the negation. After negating, the distance no longer correctly encodes the relationship to the target.
This means when you detect a collision and compute tame_dist - wild_dist, you get garbage.
The Fix
Remove the negation map from both tame and wild kangaroo kernels. The theoretical 2x speedup from the negation map is worth nothing if the algorithm produces wrong answers.
I have not found this specific bug documented anywhere. JLP's original Kangaroo implementation does not use a negation map in the walk step, which may be why -- they may have discovered this issue and silently avoided it.
Verification
After removing the negation map:
| Test | Range | Result | Time |
|---|---|---|---|
| 24-bit | 2^24 | PASS | 7.0s |
| 28-bit | 2^28 | PASS | 2.0s |
| 40-bit | 2^40 | PASS | 47.3s |
All known puzzle keys recovered correctly.
Production Concerns
Checkpoint/Resume
For large ranges (2^135), the algorithm runs for days or weeks. Binary checkpoint files save the state of all kangaroos (positions + distances) every 10 minutes. On restart, load the checkpoint and continue from where you left off.
Important lesson learned: after fixing the negation map bug, all existing checkpoints were invalid (they contained corrupted distances). Always version your checkpoint format and validate on load.
Progress Estimation
The expected number of steps is 2.08 * sqrt(range_size). I display progress as a multiple of this expected value:
Wild phase: 1,234,567,890 steps (0.6x expected) | 228 collisions
If you reach 4x expected without a collision, something is probably wrong.
Multi-GPU
The tame and wild phases can run on different GPUs. Tame kangaroos build the "trail" on one GPU, wild kangaroos search for collisions on another. The distinguished points are collected on CPU and checked against a shared hash table.
Performance
On RTX 3080 Ti:
- ~1 billion steps/second
- NB_RUN=128 kangaroos per GPU thread (higher values like 512 show zero collisions -- still investigating)
- Verified correct on 24-bit through 40-bit ranges
Takeaways
Mathematical invariants matter. The negation map bug is not a coding error -- it's a mathematical invariant violation that only affects one of two kangaroo types. No amount of code review would catch it without understanding the underlying algebra.
GPU algorithm implementation is different from CPU. You need to understand both the mathematics and the hardware constraints simultaneously. Register pressure, memory layout, and kernel launch overhead all influence algorithmic choices.
Production GPU code needs production infrastructure. Checkpointing, progress monitoring, multi-GPU coordination, and error logging are not optional for workloads that run for days.
When you're stuck, question your assumptions. The negation map is "obviously correct" according to every reference. It is correct -- for the standard formulation. My formulation had a different invariant, and the optimization didn't transfer.
I implement parallel algorithms on GPU for cryptographic and HPC workloads. If you need custom CUDA development, reach out on GitHub or LinkedIn.
Top comments (0)