<?xml version="1.0" encoding="UTF-8"?>
<rss version="2.0" xmlns:atom="http://www.w3.org/2005/Atom" xmlns:dc="http://purl.org/dc/elements/1.1/">
  <channel>
    <title>DEV Community: Raphael Bernardo</title>
    <description>The latest articles on DEV Community by Raphael Bernardo (@raphaelbgr).</description>
    <link>https://dev.to/raphaelbgr</link>
    <image>
      <url>https://media2.dev.to/dynamic/image/width=90,height=90,fit=cover,gravity=auto,format=auto/https:%2F%2Fdev-to-uploads.s3.amazonaws.com%2Fuploads%2Fuser%2Fprofile_image%2F606590%2Fda2ed3c7-ecb6-4b47-902f-b4d13e04fb4d.jpeg</url>
      <title>DEV Community: Raphael Bernardo</title>
      <link>https://dev.to/raphaelbgr</link>
    </image>
    <atom:link rel="self" type="application/rss+xml" href="https://dev.to/feed/raphaelbgr"/>
    <language>en</language>
    <item>
      <title>Implementing Pollard's Kangaroo Algorithm on CUDA</title>
      <dc:creator>Raphael Bernardo</dc:creator>
      <pubDate>Thu, 12 Mar 2026 17:02:11 +0000</pubDate>
      <link>https://dev.to/raphaelbgr/implementing-pollards-kangaroo-algorithm-on-cuda-c4j</link>
      <guid>https://dev.to/raphaelbgr/implementing-pollards-kangaroo-algorithm-on-cuda-c4j</guid>
      <description>&lt;h1&gt;
  
  
  Implementing Pollard's Kangaroo Algorithm on CUDA
&lt;/h1&gt;

&lt;p&gt;Pollard's Kangaroo is an algorithm for solving the elliptic curve discrete logarithm problem (ECDLP) when the private key is known to lie in a specific range. It runs in O(sqrt(n)) time where n is the range size -- meaning a 2^64 range needs about 2^32 steps, not 2^64.&lt;/p&gt;

&lt;p&gt;I implemented this on CUDA and got it running at about 1 billion steps per second on an RTX 3080 Ti. Along the way, I hit a subtle mathematical bug that took 10 debugging sessions to find, and it taught me more about the algorithm than any paper or textbook.&lt;/p&gt;

&lt;h2&gt;
  
  
  How the Algorithm Works
&lt;/h2&gt;

&lt;p&gt;Imagine a number line from 0 to N (the search range). You're looking for a specific point T on this line -- the target private key.&lt;/p&gt;

&lt;p&gt;You release two types of kangaroos:&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Tame kangaroos&lt;/strong&gt; start at a known position (the midpoint of the range). They hop forward by random amounts, recording their path. After enough hops, they've established a trail of "footprints" at known positions.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Wild kangaroos&lt;/strong&gt; start at an unknown position (derived from the target public key). They use the same hopping rule. Since the hops are deterministic based on the current position, if a wild kangaroo ever lands on the same point as a tame kangaroo, they'll follow identical paths from that point forward.&lt;/p&gt;

&lt;p&gt;When this collision happens, you can compute:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;private_key = tame_distance - wild_distance
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The expected number of steps before a collision is O(sqrt(n)).&lt;/p&gt;

&lt;h2&gt;
  
  
  Why GPU?
&lt;/h2&gt;

&lt;p&gt;Each kangaroo walk is independent -- different starting points, independent state. This maps perfectly to GPU threads: one kangaroo per thread, thousands running in parallel.&lt;/p&gt;

&lt;p&gt;But each step is expensive: a full secp256k1 point addition (12 field multiplications × 64 multiply-accumulate operations each = ~768 32-bit operations per step). The GPU earns its keep through raw parallelism.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Implementation
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Jump Table
&lt;/h3&gt;

&lt;p&gt;Each kangaroo's jump distance is determined by a hash of its current position. I use a table of 32 precomputed jump distances, selecting based on the low bits of the current x-coordinate:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight c"&gt;&lt;code&gt;&lt;span class="n"&gt;jump_index&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;point&lt;/span&gt;&lt;span class="p"&gt;.&lt;/span&gt;&lt;span class="n"&gt;x&lt;/span&gt; &lt;span class="o"&gt;&amp;amp;&lt;/span&gt; &lt;span class="mh"&gt;0x1F&lt;/span&gt;  &lt;span class="c1"&gt;// low 5 bits&lt;/span&gt;
&lt;span class="n"&gt;jump_distance&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;jump_table&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;jump_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;new_point&lt;/span&gt; &lt;span class="o"&gt;=&lt;/span&gt; &lt;span class="n"&gt;current_point&lt;/span&gt; &lt;span class="o"&gt;+&lt;/span&gt; &lt;span class="n"&gt;jump_table_points&lt;/span&gt;&lt;span class="p"&gt;[&lt;/span&gt;&lt;span class="n"&gt;jump_index&lt;/span&gt;&lt;span class="p"&gt;]&lt;/span&gt;
&lt;span class="n"&gt;total_distance&lt;/span&gt; &lt;span class="o"&gt;+=&lt;/span&gt; &lt;span class="n"&gt;jump_distance&lt;/span&gt;
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;The jump table is stored in constant memory -- broadcast to all threads for free.&lt;/p&gt;

&lt;h3&gt;
  
  
  Kernel Structure
&lt;/h3&gt;

&lt;p&gt;Two separate kernels run alternately:&lt;/p&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;ComputeTameKangaroos&lt;/code&gt;&lt;/strong&gt;: Each thread manages one tame kangaroo. Starting point is known (midpoint + offset). Hops forward, accumulating distance.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;&lt;code&gt;ComputeWildKangaroos&lt;/code&gt;&lt;/strong&gt;: Each thread manages one wild kangaroo. Starting point is the target public key + random offset. Same hopping rule.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;

&lt;p&gt;Each kernel runs for a batch of steps (configurable), then control returns to the CPU for collision checking.&lt;/p&gt;

&lt;h3&gt;
  
  
  Distinguished Points
&lt;/h3&gt;

&lt;p&gt;Checking every single point for collisions would require transferring enormous amounts of data from GPU to CPU. Instead, I use the &lt;strong&gt;distinguished point method&lt;/strong&gt;: only points where &lt;code&gt;hash(x) mod 2^d == 0&lt;/code&gt; are reported.&lt;/p&gt;

&lt;p&gt;This reduces the data transfer by a factor of 2^d, at the cost of slightly overshooting the actual collision point by an average of 2^d steps. For d=20, about 1 in a million points is distinguished.&lt;/p&gt;

&lt;h3&gt;
  
  
  256-bit Arithmetic
&lt;/h3&gt;

&lt;p&gt;Each thread needs:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;3 × 256-bit coordinates for the current point (Jacobian: X, Y, Z)&lt;/li&gt;
&lt;li&gt;1 × 256-bit accumulator for total distance&lt;/li&gt;
&lt;li&gt;Temporary variables for field arithmetic&lt;/li&gt;
&lt;/ul&gt;

&lt;p&gt;That's ~128 bytes of register space per thread before temporaries. With 255 registers available per thread and each register holding 4 bytes, we have room -- but it's tight. I use GRP_SIZE=4 (process 4 steps per kernel invocation per thread) to amortize kernel launch overhead without increasing register pressure.&lt;/p&gt;

&lt;h2&gt;
  
  
  The Negation Map Bug
&lt;/h2&gt;

&lt;p&gt;Here's where the story gets interesting. Standard implementations of Pollard's Kangaroo use a "negation map" optimization: if the y-coordinate of the current point indicates it's in the "negative half" of the curve, negate both the point and the distance. This effectively doubles the chance of collisions because equivalent points are mapped to a canonical form.&lt;/p&gt;

&lt;p&gt;I implemented this. It seemed to work on small test cases. Then on larger ranges, the algorithm would run for 10x the expected time with zero valid collisions.&lt;/p&gt;

&lt;h3&gt;
  
  
  10 Sessions of Debugging
&lt;/h3&gt;

&lt;p&gt;Sessions 1-3: "Maybe the jump table is wrong." Verified every entry. Table was correct.&lt;/p&gt;

&lt;p&gt;Sessions 4-5: "Maybe the GPU arithmetic has a bug." Compared GPU output against a Python reference implementation for every step. Arithmetic was correct.&lt;/p&gt;

&lt;p&gt;Sessions 6-7: "Maybe the collision detection is broken." Added extensive logging. Collisions were being detected, but the recovered keys were wrong.&lt;/p&gt;

&lt;p&gt;Sessions 8-9: "Maybe the distance accumulation has an overflow." Added overflow checks everywhere. No overflows.&lt;/p&gt;

&lt;p&gt;Session 10: Found it.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Root Cause
&lt;/h3&gt;

&lt;p&gt;The negation map preserves an invariant for tame kangaroos:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;distance * G = point    // tame invariant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After negation: &lt;code&gt;(N - distance) * G = -point&lt;/code&gt;. The invariant still holds because we negated both sides.&lt;/p&gt;

&lt;p&gt;For wild kangaroos, the invariant is different:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;target + distance * G = point    // wild invariant
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;After negation: &lt;code&gt;(N - distance) * G = -point&lt;/code&gt;. But &lt;code&gt;target + (N - distance) * G = target - point ≠ -point&lt;/code&gt;. &lt;strong&gt;The invariant breaks.&lt;/strong&gt;&lt;/p&gt;

&lt;p&gt;The negation map assumes the relationship between distance and point is &lt;code&gt;distance * G = point&lt;/code&gt;. For wild kangaroos, there's an additive constant (the target public key) that doesn't participate in the negation. After negating, the distance no longer correctly encodes the relationship to the target.&lt;/p&gt;

&lt;p&gt;This means when you detect a collision and compute &lt;code&gt;tame_dist - wild_dist&lt;/code&gt;, you get garbage.&lt;/p&gt;

&lt;h3&gt;
  
  
  The Fix
&lt;/h3&gt;

&lt;p&gt;Remove the negation map from both tame and wild kangaroo kernels. The theoretical 2x speedup from the negation map is worth nothing if the algorithm produces wrong answers.&lt;/p&gt;

&lt;p&gt;I have not found this specific bug documented anywhere. JLP's original Kangaroo implementation does not use a negation map in the walk step, which may be why -- they may have discovered this issue and silently avoided it.&lt;/p&gt;

&lt;h3&gt;
  
  
  Verification
&lt;/h3&gt;

&lt;p&gt;After removing the negation map:&lt;/p&gt;

&lt;div class="table-wrapper-paragraph"&gt;&lt;table&gt;
&lt;thead&gt;
&lt;tr&gt;
&lt;th&gt;Test&lt;/th&gt;
&lt;th&gt;Range&lt;/th&gt;
&lt;th&gt;Result&lt;/th&gt;
&lt;th&gt;Time&lt;/th&gt;
&lt;/tr&gt;
&lt;/thead&gt;
&lt;tbody&gt;
&lt;tr&gt;
&lt;td&gt;24-bit&lt;/td&gt;
&lt;td&gt;2^24&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;7.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;28-bit&lt;/td&gt;
&lt;td&gt;2^28&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;2.0s&lt;/td&gt;
&lt;/tr&gt;
&lt;tr&gt;
&lt;td&gt;40-bit&lt;/td&gt;
&lt;td&gt;2^40&lt;/td&gt;
&lt;td&gt;PASS&lt;/td&gt;
&lt;td&gt;47.3s&lt;/td&gt;
&lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;&lt;/div&gt;

&lt;p&gt;All known puzzle keys recovered correctly.&lt;/p&gt;

&lt;h2&gt;
  
  
  Production Concerns
&lt;/h2&gt;

&lt;h3&gt;
  
  
  Checkpoint/Resume
&lt;/h3&gt;

&lt;p&gt;For large ranges (2^135), the algorithm runs for days or weeks. Binary checkpoint files save the state of all kangaroos (positions + distances) every 10 minutes. On restart, load the checkpoint and continue from where you left off.&lt;/p&gt;

&lt;p&gt;&lt;strong&gt;Important lesson learned&lt;/strong&gt;: after fixing the negation map bug, all existing checkpoints were invalid (they contained corrupted distances). Always version your checkpoint format and validate on load.&lt;/p&gt;

&lt;h3&gt;
  
  
  Progress Estimation
&lt;/h3&gt;

&lt;p&gt;The expected number of steps is &lt;code&gt;2.08 * sqrt(range_size)&lt;/code&gt;. I display progress as a multiple of this expected value:&lt;br&gt;
&lt;/p&gt;

&lt;div class="highlight js-code-highlight"&gt;
&lt;pre class="highlight plaintext"&gt;&lt;code&gt;Wild phase: 1,234,567,890 steps (0.6x expected) | 228 collisions
&lt;/code&gt;&lt;/pre&gt;

&lt;/div&gt;



&lt;p&gt;If you reach 4x expected without a collision, something is probably wrong.&lt;/p&gt;

&lt;h3&gt;
  
  
  Multi-GPU
&lt;/h3&gt;

&lt;p&gt;The tame and wild phases can run on different GPUs. Tame kangaroos build the "trail" on one GPU, wild kangaroos search for collisions on another. The distinguished points are collected on CPU and checked against a shared hash table.&lt;/p&gt;

&lt;h2&gt;
  
  
  Performance
&lt;/h2&gt;

&lt;p&gt;On RTX 3080 Ti:&lt;/p&gt;

&lt;ul&gt;
&lt;li&gt;&lt;strong&gt;~1 billion steps/second&lt;/strong&gt;&lt;/li&gt;
&lt;li&gt;
&lt;strong&gt;NB_RUN=128&lt;/strong&gt; kangaroos per GPU thread (higher values like 512 show zero collisions -- still investigating)&lt;/li&gt;
&lt;li&gt;Verified correct on 24-bit through 40-bit ranges&lt;/li&gt;
&lt;/ul&gt;

&lt;h2&gt;
  
  
  Takeaways
&lt;/h2&gt;

&lt;ol&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Mathematical invariants matter.&lt;/strong&gt; The negation map bug is not a coding error -- it's a mathematical invariant violation that only affects one of two kangaroo types. No amount of code review would catch it without understanding the underlying algebra.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;GPU algorithm implementation is different from CPU.&lt;/strong&gt; You need to understand both the mathematics and the hardware constraints simultaneously. Register pressure, memory layout, and kernel launch overhead all influence algorithmic choices.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;Production GPU code needs production infrastructure.&lt;/strong&gt; Checkpointing, progress monitoring, multi-GPU coordination, and error logging are not optional for workloads that run for days.&lt;/p&gt;&lt;/li&gt;
&lt;li&gt;&lt;p&gt;&lt;strong&gt;When you're stuck, question your assumptions.&lt;/strong&gt; The negation map is "obviously correct" according to every reference. It is correct -- for the standard formulation. My formulation had a different invariant, and the optimization didn't transfer.&lt;/p&gt;&lt;/li&gt;
&lt;/ol&gt;




&lt;p&gt;&lt;em&gt;I implement parallel algorithms on GPU for cryptographic and HPC workloads. If you need custom CUDA development, reach out on &lt;a href="https://github.com/raphaelbgr" rel="noopener noreferrer"&gt;GitHub&lt;/a&gt; or &lt;a href="https://linkedin.com/in/raphaelbgr" rel="noopener noreferrer"&gt;LinkedIn&lt;/a&gt;.&lt;/em&gt;&lt;/p&gt;

</description>
      <category>cuda</category>
      <category>gpu</category>
      <category>cryptography</category>
      <category>algorithms</category>
    </item>
  </channel>
</rss>
