DEV Community

Cover image for When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA
Mr. 0x1
Mr. 0x1

Posted on

When I Took Numba to the Dojo: A Battle Royale Against Rust and CUDA

A Thank You Note

Before we dive in, I want to acknowledge Shreyan Ghosh (@zenoguy) and his wonderful article "When Time Became a Variable โ€” Notes From My Journey With Numba".

His piece captured something beautiful about computing: the joy of experimentation, the thrill of watching code go fast, and the curiosity to ask "what if?"

This line stuck with me:

"Somewhere between algorithms and hardware, Numba didn't just make my code faster. It made exploration lighter."

Reading his benchmarks, I couldn't help but wonder: What happens when we throw Rust into the mix? What about raw CUDA? Where does the hardware actually give up?

So I built a dojo. Let's spar.


๐ŸŽฏ The Challenge

Same challenge as Shreyan's original experiment:

f(x) = sqrt(xยฒ + 1) ร— sin(x) + cos(x/2)
Enter fullscreen mode Exit fullscreen mode

Compute this for 20 million elements.

Simple math. Maximum optimization. Who wins?


๐ŸฅŠ The Contenders

I assembled fighters from different worlds:

Team Python ๐Ÿ

  • Pure Python โ€” The baseline. Interpreter overhead. GIL-bound.
  • NumPy Vectorized โ€” The standard approach.
  • Numba JIT โ€” Single-threaded compiled.
  • Numba Parallel โ€” Multi-threaded with prange.
  • Numba @vectorize โ€” Parallel ufunc magic.

Team Rust ๐Ÿฆ€

  • Single-threaded โ€” Idiomatic iterators.
  • Parallel (Rayon) โ€” Work-stealing parallelism.
  • Parallel Chunks โ€” Cache-optimized chunking.

Team GPU ๐ŸŽฎ

  • Numba CUDA โ€” Python on the GPU.
  • CUDA C++ FP64 โ€” Double precision native.
  • CUDA C++ FP32 โ€” Single precision native.
  • CUDA C++ Intrinsics โ€” Hardware-optimized math.

๐Ÿ—๏ธ The Setup

I wanted this to be reproducible and fair:

  • Same computation across all implementations
  • Same array size (20 million float64 elements)
  • Same random seed (42, obviously)
  • Multiple warmup runs to eliminate JIT/cache effects
  • Take the minimum of multiple runs (least noise)

The full benchmark suite is open source: github.com/copyleftdev/numba-dojo

# Run everything yourself
git clone https://github.com/copyleftdev/numba-dojo.git
cd numba-dojo
make all
Enter fullscreen mode Exit fullscreen mode

๐Ÿ“Š The Results

Let's see who survived the dojo.

The Full Leaderboard

Benchmark Results

Rank Implementation Time Speedup vs NumPy
๐Ÿฅ‡ CUDA C++ FP32 0.21 ms 3,255x
๐Ÿฅˆ Numba CUDA FP32 2.52 ms 265x
๐Ÿฅ‰ CUDA C++ FP64 4.11 ms 162x
4 Numba CUDA FP64 4.14 ms 161x
5 Rust Parallel 12.39 ms 54x
6 Numba @vectorize 14.86 ms 45x
7 Numba Parallel 15.55 ms 43x
8 Rust Single 555.62 ms 1.2x
9 Numba JIT 558.30 ms 1.2x
10 NumPy Vectorized 667.30 ms 1.0x
11 Pure Python ~6,650 ms 0.1x

Speedup Visualization

Speedup Chart

Category Champions

Category Comparison


๐Ÿ”ฌ What I Learned

1. GPU Demolishes CPU (When It Fits)

The RTX 3080 Ti at 0.21ms is 3,255x faster than NumPy. That's not a typo.

For embarrassingly parallel workloads like element-wise computation, GPUs are in a different league. The massive parallelism (80 streaming multiprocessors, thousands of cores) absolutely crushes sequential execution.

2. FP32 is 20x Faster Than FP64 on Consumer GPUs

CUDA FP64: 4.11 ms
CUDA FP32: 0.21 ms  โ† 20x faster!
Enter fullscreen mode Exit fullscreen mode

Consumer GPUs (GeForce series) have limited FP64 units โ€” typically 1/32 the throughput of FP32. If your computation can tolerate single precision, use it.

3. Rust โ‰ˆ Numba JIT (Single-Threaded)

Rust Single:  555.62 ms
Numba JIT:    558.30 ms
Enter fullscreen mode Exit fullscreen mode

Both compile to LLVM IR. Both get similar codegen. The difference is noise. This validates Numba's claim: "Feels like Python, behaves like C."

4. Rust Beats Numba in Parallel (~20%)

Rust Parallel (Rayon):  12.39 ms
Numba Parallel:         15.55 ms
Enter fullscreen mode Exit fullscreen mode

Rayon's work-stealing scheduler has lower overhead than Numba's threading. For CPU-parallel workloads in production, Rust has an edge.

5. We Hit the Memory Bandwidth Wall

This was the most interesting discovery.

When I profiled the FP32 CUDA kernel:

Time:       0.21 ms
Bandwidth:  ~777 GB/s achieved
Theoretical: 912 GB/s (RTX 3080 Ti)
Efficiency: 85%
Enter fullscreen mode Exit fullscreen mode

We're running at 85% of peak memory bandwidth. The GPU cores are actually waiting for data. No algorithm can beat physics.

This is the Roofline Model in action:

                    Peak Compute
                         /
                        /
Performance            /
                      /  โ† We're here (memory-bound)
                     /
                    /
            โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€โ”€
                Memory Bandwidth
Enter fullscreen mode Exit fullscreen mode

For this workload with low arithmetic intensity (few ops per byte), we've hit the ceiling.


๐Ÿงช The Code

Here's what each implementation looks like:

Numba (The Hero of The Original Article)

from numba import njit, prange
import numpy as np

@njit(parallel=True, fastmath=True, cache=True)
def compute_numba_parallel(arr, out):
    n = len(arr)
    for i in prange(n):
        val = arr[i]
        out[i] = np.sqrt(val * val + 1.0) * np.sin(val) + np.cos(val * 0.5)
Enter fullscreen mode Exit fullscreen mode

Just add @njit. That's it. Shreyan was right โ€” this is magical.

Rust (The Challenger)

use rayon::prelude::*;

fn compute_parallel(arr: &[f64], out: &mut [f64]) {
    out.par_iter_mut()
        .zip(arr.par_iter())
        .for_each(|(o, &v)| {
            *o = (v * v + 1.0).sqrt() * v.sin() + (v * 0.5).cos();
        });
}
Enter fullscreen mode Exit fullscreen mode

Rayon makes parallelism feel as natural as iterators.

CUDA C++ (The Champion)

__global__ void compute_fp32(const float* arr, float* out, size_t n) {
    for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
         idx < n;
         idx += blockDim.x * gridDim.x) {
        float val = arr[idx];
        out[idx] = sqrtf(val * val + 1.0f) * sinf(val) + cosf(val * 0.5f);
    }
}
Enter fullscreen mode Exit fullscreen mode

Grid-stride loops for maximum occupancy.


๐ŸŽฏ When to Use What

Scenario Recommendation
Quick prototyping NumPy (it's fine, really)
Need 10-50x speedup, stay in Python Numba parallel
Production CPU workloads Rust + Rayon
Maximum performance, GPU available CUDA (FP32 if possible)
GPU + Python ecosystem Numba CUDA

๐Ÿ™ Final Thoughts

Shreyan's original article reminded me why I love computing: we get to ask "what if?" and then actually find out.

What if we compile this loop? 43x faster.
What if we use all CPU cores? 54x faster.
What if we throw a GPU at it? 3,255x faster.
What if we hit the memory bandwidth wall? Physics wins.

The journey from Pure Python (6.6 seconds) to CUDA FP32 (0.2 milliseconds) is a 33,000x improvement. That's not optimization โ€” that's transformation.


๐Ÿ”— Resources


Keep experimenting. Keep playing. That's what computing is for. โœจ


What's your favorite performance optimization story? Drop it in the comments!

Top comments (0)