A Thank You Note
Before we dive in, I want to acknowledge Shreyan Ghosh (@zenoguy) and his wonderful article "When Time Became a Variable โ Notes From My Journey With Numba".
His piece captured something beautiful about computing: the joy of experimentation, the thrill of watching code go fast, and the curiosity to ask "what if?"
This line stuck with me:
"Somewhere between algorithms and hardware, Numba didn't just make my code faster. It made exploration lighter."
Reading his benchmarks, I couldn't help but wonder: What happens when we throw Rust into the mix? What about raw CUDA? Where does the hardware actually give up?
So I built a dojo. Let's spar.
๐ฏ The Challenge
Same challenge as Shreyan's original experiment:
f(x) = sqrt(xยฒ + 1) ร sin(x) + cos(x/2)
Compute this for 20 million elements.
Simple math. Maximum optimization. Who wins?
๐ฅ The Contenders
I assembled fighters from different worlds:
Team Python ๐
- Pure Python โ The baseline. Interpreter overhead. GIL-bound.
- NumPy Vectorized โ The standard approach.
- Numba JIT โ Single-threaded compiled.
-
Numba Parallel โ Multi-threaded with
prange. - Numba @vectorize โ Parallel ufunc magic.
Team Rust ๐ฆ
- Single-threaded โ Idiomatic iterators.
- Parallel (Rayon) โ Work-stealing parallelism.
- Parallel Chunks โ Cache-optimized chunking.
Team GPU ๐ฎ
- Numba CUDA โ Python on the GPU.
- CUDA C++ FP64 โ Double precision native.
- CUDA C++ FP32 โ Single precision native.
- CUDA C++ Intrinsics โ Hardware-optimized math.
๐๏ธ The Setup
I wanted this to be reproducible and fair:
- Same computation across all implementations
- Same array size (20 million float64 elements)
- Same random seed (42, obviously)
- Multiple warmup runs to eliminate JIT/cache effects
- Take the minimum of multiple runs (least noise)
The full benchmark suite is open source: github.com/copyleftdev/numba-dojo
# Run everything yourself
git clone https://github.com/copyleftdev/numba-dojo.git
cd numba-dojo
make all
๐ The Results
Let's see who survived the dojo.
The Full Leaderboard
| Rank | Implementation | Time | Speedup vs NumPy |
|---|---|---|---|
| ๐ฅ | CUDA C++ FP32 | 0.21 ms | 3,255x |
| ๐ฅ | Numba CUDA FP32 | 2.52 ms | 265x |
| ๐ฅ | CUDA C++ FP64 | 4.11 ms | 162x |
| 4 | Numba CUDA FP64 | 4.14 ms | 161x |
| 5 | Rust Parallel | 12.39 ms | 54x |
| 6 | Numba @vectorize | 14.86 ms | 45x |
| 7 | Numba Parallel | 15.55 ms | 43x |
| 8 | Rust Single | 555.62 ms | 1.2x |
| 9 | Numba JIT | 558.30 ms | 1.2x |
| 10 | NumPy Vectorized | 667.30 ms | 1.0x |
| 11 | Pure Python | ~6,650 ms | 0.1x |
Speedup Visualization
Category Champions
๐ฌ What I Learned
1. GPU Demolishes CPU (When It Fits)
The RTX 3080 Ti at 0.21ms is 3,255x faster than NumPy. That's not a typo.
For embarrassingly parallel workloads like element-wise computation, GPUs are in a different league. The massive parallelism (80 streaming multiprocessors, thousands of cores) absolutely crushes sequential execution.
2. FP32 is 20x Faster Than FP64 on Consumer GPUs
CUDA FP64: 4.11 ms
CUDA FP32: 0.21 ms โ 20x faster!
Consumer GPUs (GeForce series) have limited FP64 units โ typically 1/32 the throughput of FP32. If your computation can tolerate single precision, use it.
3. Rust โ Numba JIT (Single-Threaded)
Rust Single: 555.62 ms
Numba JIT: 558.30 ms
Both compile to LLVM IR. Both get similar codegen. The difference is noise. This validates Numba's claim: "Feels like Python, behaves like C."
4. Rust Beats Numba in Parallel (~20%)
Rust Parallel (Rayon): 12.39 ms
Numba Parallel: 15.55 ms
Rayon's work-stealing scheduler has lower overhead than Numba's threading. For CPU-parallel workloads in production, Rust has an edge.
5. We Hit the Memory Bandwidth Wall
This was the most interesting discovery.
When I profiled the FP32 CUDA kernel:
Time: 0.21 ms
Bandwidth: ~777 GB/s achieved
Theoretical: 912 GB/s (RTX 3080 Ti)
Efficiency: 85%
We're running at 85% of peak memory bandwidth. The GPU cores are actually waiting for data. No algorithm can beat physics.
This is the Roofline Model in action:
Peak Compute
/
/
Performance /
/ โ We're here (memory-bound)
/
/
โโโโโโโโโโโโโโโโโโโโโโ
Memory Bandwidth
For this workload with low arithmetic intensity (few ops per byte), we've hit the ceiling.
๐งช The Code
Here's what each implementation looks like:
Numba (The Hero of The Original Article)
from numba import njit, prange
import numpy as np
@njit(parallel=True, fastmath=True, cache=True)
def compute_numba_parallel(arr, out):
n = len(arr)
for i in prange(n):
val = arr[i]
out[i] = np.sqrt(val * val + 1.0) * np.sin(val) + np.cos(val * 0.5)
Just add @njit. That's it. Shreyan was right โ this is magical.
Rust (The Challenger)
use rayon::prelude::*;
fn compute_parallel(arr: &[f64], out: &mut [f64]) {
out.par_iter_mut()
.zip(arr.par_iter())
.for_each(|(o, &v)| {
*o = (v * v + 1.0).sqrt() * v.sin() + (v * 0.5).cos();
});
}
Rayon makes parallelism feel as natural as iterators.
CUDA C++ (The Champion)
__global__ void compute_fp32(const float* arr, float* out, size_t n) {
for (size_t idx = blockIdx.x * blockDim.x + threadIdx.x;
idx < n;
idx += blockDim.x * gridDim.x) {
float val = arr[idx];
out[idx] = sqrtf(val * val + 1.0f) * sinf(val) + cosf(val * 0.5f);
}
}
Grid-stride loops for maximum occupancy.
๐ฏ When to Use What
| Scenario | Recommendation |
|---|---|
| Quick prototyping | NumPy (it's fine, really) |
| Need 10-50x speedup, stay in Python | Numba parallel |
| Production CPU workloads | Rust + Rayon |
| Maximum performance, GPU available | CUDA (FP32 if possible) |
| GPU + Python ecosystem | Numba CUDA |
๐ Final Thoughts
Shreyan's original article reminded me why I love computing: we get to ask "what if?" and then actually find out.
What if we compile this loop? 43x faster.
What if we use all CPU cores? 54x faster.
What if we throw a GPU at it? 3,255x faster.
What if we hit the memory bandwidth wall? Physics wins.
The journey from Pure Python (6.6 seconds) to CUDA FP32 (0.2 milliseconds) is a 33,000x improvement. That's not optimization โ that's transformation.
๐ Resources
- Full source code: github.com/copyleftdev/numba-dojo
- Original inspiration: @zenoguy's Numba article
- Numba docs: numba.pydata.org
- Rayon (Rust): docs.rs/rayon
- Roofline Model: Wikipedia
Keep experimenting. Keep playing. That's what computing is for. โจ
What's your favorite performance optimization story? Drop it in the comments!



Top comments (0)