zenoguy

Posted on Dec 24, 2025

When Time Became a Variable — Notes From My Journey With Numba ⚡

#python #numba #numpy #performance

I wasn’t chasing performance at first.

I was deep inside some heavy computation — image processing, remote sensing, NumPy-heavy workflows — and things were taking too long.

While everyone’s sleeping, I’m out here crunching heat maps and chasing anomalies at 3 AM on christmas. Santa didn’t bring gifts this year — he brought publication-worthy datas. 🎅🔥

That’s when I stumbled upon Numba.

What began as a normal experimentation loop slowly turned into a waiting game. Iterations stretched. Feedback slowed curiosity down. And Numba didn’t enter my workflow as a “speed hack” — it entered as a way to bring thinking and computation back into sync.

And that changed how I work with performance entirely.

🧠 Why Numba Feels Different To Use

NumPy is already powerful, but some workloads naturally gravitate toward loops:

pixel / cell-level transformations
iterative grid passes
rolling & stencil-style operations
custom kernels that don’t exist in libraries

These are mathematically honest — but painfully slow in Python.

Numba compiles those functions to optimized machine code through LLVM (via @njit), which means:

Python syntax stays
compiled execution takes over
the bottleneck disappears

To make it happy, I had to:

keep data shapes predictable
avoid Python objects in hot paths
think about memory as something physical

That discipline didn’t just make things faster.

It made the code clearer.

⚡ What The Numbers Look Like (From Numba Benchmarks)

From Numba’s documentation and example workloads, parallel compilation can deliver dramatic CPU-scale gains:

Variant	Time	Notes
NumPy implementation	~5.8s	Interpreter overhead + limited parallelism
`@njit` single-threaded	~700ms	Big win already
`@njit(parallel=True)`	~112ms	Multithreaded + vectorized

That’s ~5× faster than NumPy, and significantly faster than non-parallel JIT on CPU-bound loops.

But I wanted to see what this looked like in my own environment.

So I benchmarked it.

🧪 My Local Benchmark (20,000,000-element loop)

Same logic. Same data. Three execution models:

Variant	Median Runtime	Min Runtime	Speedup vs Python
Python + NumPy loop (GIL-bound)	2.5418 s	2.5327 s	1×
Numba (`@njit`, single-threaded)	0.0150 s	0.0147 s	~170×
Numba Parallel (`@njit(parallel=True)`)	0.0057 s	0.0054 s	~445×

I stared at that table for a second and just laughed — the difference is wild.

The pattern was impossible to ignore:

Python loop = fine for logic, terrible for math
Numba JIT = removes interpreter overhead
Parallel Numba = unleashes full CPU cores

And the biggest effect wasn’t just speed.

It was shortened feedback cycles.

🧵 Why Numba Beats Normal Python For CPU Workloads

Pure Python is limited by the GIL.

Even if you create threads, only one runs Python bytecode at a time. Multiprocessing helps, but adds IPC + serialization overhead.

Inside a compiled Numba function:

the GIL is released
operations run as native machine code
loops scale across CPU cores (when safe to parallelize)

Conceptually:

Approach	Threads	Behavior
Pure Python loop	🚫 GIL-bound	Slow
NumPy ufuncs	✅ Multithreaded internally	Fast enough
`@njit`	❗ Single-thread machine code	Much faster
`@njit(parallel=True)`	✅ Multithreaded + SIMD	Fastest

When your workload lives inside numeric loops, parallel=True feels like adding oxygen.

🧩 “Interactive” Comparison Block

🔍 Before: Pure Python Loop

Slow. Interpreter overhead. GIL-bound.

Best used for logic, not computation.

⚙️ After: Numba JIT-Compiled Loop

compiled via LLVM
CPU-native execution
predictable performance

Feels like Python, behaves like C.

🚀 Parallel Numba (prange + parallel=True)

spreads work across CPU cores
releases the GIL inside hot loops
ideal for pixel / grid workloads

Where Numba truly shines on CPUs.

🎁 Underrated Numba Features I Learned To Appreciate

cache=True
Reuse compiled code across runs.

nopython=True
Forces discipline. Reveals hidden Python objects.

parallel=True + prange
Turns heavy loops into multithreaded kernels.

fastmath=True
Lets the compiler vectorize aggressively (when numerics allow).

But the biggest gift wasn’t raw performance.

It was momentum.

Research cycles shifted from:

write → run → wait → context-switch

into:

write → run → iterate

And curiosity stayed in motion.

⚖️ Real-World Caveats That Matter

Numba isn’t a silver bullet.

first call includes compile warm-up
debugging inside JIT code can sting
sometimes NumPy is already optimal
chaotic control-flow doesn’t JIT well

It works best when:

logic is numeric
loops are intentional
computation is meaningful

It isn’t glitter.

It’s a performance contract.

🧭 What Numba Changed In How I Write Code

It nudged me to:

separate meaningful loops from accidental ones
design transformations with purpose
treat performance as part of expression

Somewhere between algorithms and hardware, Numba didn’t just make my code faster.

It made exploration lighter.

⚡

DEV Community