Muhammad Ikramullah Khan

Posted on Mar 3

Advanced NumPy: The Performance Tricks That Separate Pros From Beginners

#python #programming #dataengineering #datascience

You've been using NumPy for a while now. You understand broadcasting. You avoid loops. Your code works.

But then you run it on real data. A million rows. Ten million. Your script that worked fine on 1,000 samples suddenly takes 10 minutes. Or worse, it crashes with a memory error.

You start profiling. Turns out, that one innocent looking line is eating 90% of your runtime. The operation you thought was O(n) is actually O(n²). Your "vectorized" code is still allocating gigabytes of temporary arrays you didn't know existed.

Here's the truth. Knowing the basics of NumPy gets you 80% of the way there. The last 20% is where the real performance lives. It's in the details nobody talks about. The memory layout tricks. The stride manipulation. The C-order vs Fortran-order gotchas. The obscure functions that solve problems in one line instead of fifty.

Let me show you the advanced stuff.

Memory Layout: Why Your Fast Code Is Actually Slow

NumPy arrays are stored in contiguous memory blocks. But there are two ways to arrange that memory.

Row-Major (C) vs Column-Major (Fortran)

import numpy as np

# C-contiguous (row-major, default)
c_array = np.arange(12).reshape(3, 4)
print(c_array)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(c_array.flags)
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False

# Memory layout in C-order: [0,1,2,3,4,5,6,7,8,9,10,11]
# Rows are stored together

import numpy as np

# Fortran-contiguous (column-major)
f_array = np.asfortranarray(c_array)

print(f_array.flags)
# C_CONTIGUOUS : False
# F_CONTIGUOUS : True

# Memory layout in F-order: [0,4,8,1,5,9,2,6,10,3,7,11]
# Columns are stored together

Why this matters:

import numpy as np
import time

# Create large arrays
size = 10000
c_array = np.arange(size * size).reshape(size, size)
f_array = np.asfortranarray(c_array)

# Row operations on C-array (fast)
start = time.time()
for i in range(size):
    row_sum = c_array[i, :].sum()
c_time = time.time() - start

# Row operations on F-array (slow)
start = time.time()
for i in range(size):
    row_sum = f_array[i, :].sum()
f_time = time.time() - start

print(f"C-array row ops: {c_time:.3f}s")
print(f"F-array row ops: {f_time:.3f}s")
print(f"F-array is {f_time/c_time:.1f}x slower for rows")

# Column operations on F-array (fast)
start = time.time()
for i in range(size):
    col_sum = f_array[:, i].sum()
f_time_col = time.time() - start

# Column operations on C-array (slow)
start = time.time()
for i in range(size):
    col_sum = c_array[:, i].sum()
c_time_col = time.time() - start

print(f"\nF-array column ops: {f_time_col:.3f}s")
print(f"C-array column ops: {c_time_col:.3f}s")
print(f"C-array is {c_time_col/f_time_col:.1f}x slower for columns")

On my machine, F-array row operations are 3x slower. C-array column operations are 3x slower.

The trick:

If you're doing heavy column operations (common in linear algebra, some machine learning), convert to Fortran order first.

import numpy as np

data = np.random.rand(10000, 10000)

# Converting to F-order for column operations
data_f = np.asfortranarray(data)

# Now column operations are 3x faster
column_means = data_f.mean(axis=0)

Strides: The Hidden Performance Lever

Every NumPy array has strides. These tell NumPy how many bytes to skip to move to the next element in each dimension.

import numpy as np

arr = np.arange(12).reshape(3, 4)
print(arr)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(arr.strides)
# (32, 8)

What does (32, 8) mean?

To move to the next row, skip 32 bytes (4 elements × 8 bytes per int64)
To move to the next column, skip 8 bytes (1 element)

The trick: You can manipulate strides to create views without copying data.

Creating a Sliding Window View

import numpy as np
from numpy.lib.stride_tricks import as_strided

data = np.arange(10)
print(data)
# [0 1 2 3 4 5 6 7 8 9]

# Create sliding window of size 3
window_size = 3
n_windows = len(data) - window_size + 1

# Manually set strides to create windows
windows = as_strided(
    data,
    shape=(n_windows, window_size),
    strides=(data.itemsize, data.itemsize)
)

print(windows)
# [[0 1 2]
#  [1 2 3]
#  [2 3 4]
#  [3 4 5]
#  [4 5 6]
#  [5 6 7]
#  [6 7 8]
#  [7 8 9]]

No data was copied. All windows are views into the original array. This is 100x faster than creating windows with a loop.

Real-world use case: Rolling averages

import numpy as np
from numpy.lib.stride_tricks import as_strided

data = np.random.rand(1000000)
window_size = 100

# Create sliding windows
n_windows = len(data) - window_size + 1
windows = as_strided(
    data,
    shape=(n_windows, window_size),
    strides=(data.itemsize, data.itemsize)
)

# Calculate rolling average (mean of each window)
rolling_avg = windows.mean(axis=1)

print(f"Original size: {len(data)}")
print(f"Rolling averages: {len(rolling_avg)}")

This computes 999,901 averages instantly. No loops. No copies.

Warning: as_strided is dangerous. Invalid strides can read garbage memory or crash Python. Always validate your stride calculations.

Advanced Indexing: Beyond the Basics

Integer Array Indexing (The Confusing One)

import numpy as np

data = np.arange(20).reshape(4, 5)
print(data)
# [[ 0  1  2  3  4]
#  [ 5  6  7  8  9]
#  [10 11 12 13 14]
#  [15 16 17 18 19]]

# Grab specific elements by (row, col) pairs
rows = np.array([0, 1, 2, 3])
cols = np.array([0, 1, 2, 3])

result = data[rows, cols]
print(result)
# [ 0  6 12 18]  # Diagonal elements

This grabs data[0,0], data[1,1], data[2,2], data[3,3].

The trick: You can use this for advanced filtering.

import numpy as np

data = np.random.rand(1000, 100)

# Find row and column indices where value > 0.99
rows, cols = np.where(data > 0.99)

# Get those specific values
high_values = data[rows, cols]

print(f"Found {len(high_values)} values > 0.99")

np.ix_: Grid Indexing Made Easy

import numpy as np

data = np.arange(100).reshape(10, 10)

# Want rows [1, 3, 5] and columns [2, 4, 6]
# Without np.ix_, this is painful

# With np.ix_:
rows = np.array([1, 3, 5])
cols = np.array([2, 4, 6])

subgrid = data[np.ix_(rows, cols)]
print(subgrid)
# [[12 14 16]
#  [32 34 36]
#  [52 54 56]]

np.ix_ creates the right indexing arrays for you. No manual broadcasting needed.

Vectorization Patterns Nobody Teaches

Pattern 1: Vectorizing Distance Calculations

Problem: Calculate pairwise distances between all points.

Naive approach (loops):

import numpy as np

points = np.random.rand(1000, 2)  # 1000 points in 2D

# Slow (nested loops)
distances = np.zeros((1000, 1000))
for i in range(1000):
    for j in range(1000):
        distances[i, j] = np.sqrt(np.sum((points[i] - points[j])**2))

# Takes seconds

Vectorized approach:

import numpy as np

points = np.random.rand(1000, 2)

# Expand dimensions for broadcasting
p1 = points[:, np.newaxis, :]  # Shape: (1000, 1, 2)
p2 = points[np.newaxis, :, :]  # Shape: (1, 1000, 2)

# Broadcast subtraction
diff = p1 - p2  # Shape: (1000, 1000, 2)

# Calculate distances
distances = np.sqrt((diff**2).sum(axis=2))

# Takes milliseconds

10-100x faster. No loops.

Even faster with einsum:

import numpy as np

points = np.random.rand(1000, 2)

# Using einsum for dot products
sq_distances = (
    np.einsum('ij,ij->i', points, points)[:, np.newaxis] +
    np.einsum('ij,ij->i', points, points)[np.newaxis, :] -
    2 * np.einsum('ik,jk->ij', points, points)
)

distances = np.sqrt(sq_distances)

This is the fastest pure NumPy approach.

Pattern 2: Vectorizing Conditional Logic

Problem: Apply different formulas based on conditions.

import numpy as np

x = np.random.rand(1000000) * 10

# Slow (loop with if-else)
result = np.zeros(len(x))
for i in range(len(x)):
    if x[i] < 3:
        result[i] = x[i] ** 2
    elif x[i] < 7:
        result[i] = x[i] * 10
    else:
        result[i] = x[i] / 2

Vectorized with np.select:

import numpy as np

x = np.random.rand(1000000) * 10

# Define conditions
conditions = [
    x < 3,
    (x >= 3) & (x < 7),
    x >= 7
]

# Define corresponding functions
choices = [
    x ** 2,
    x * 10,
    x / 2
]

# Apply all at once
result = np.select(conditions, choices)

100x faster. One pass through the data instead of a million if-else checks.

Pattern 3: Vectorizing Cumulative Operations with Different Windows

Problem: Calculate moving average with varying window sizes.

import numpy as np

data = np.random.rand(10000)
window_sizes = np.random.randint(5, 100, size=10000)

# Can't use np.convolve because window size changes
# Need np.add.reduceat

This is where reduceat shines, but it's complex. Better approach: use numba for this case or accept a loop.

Memory-Efficient Operations

Problem: Operating on Arrays Larger Than RAM

You have a 50GB array. You have 16GB RAM. How do you process it?

Solution 1: Memory-Mapped Arrays

import numpy as np

# Create a memory-mapped array (stored on disk)
mmap_array = np.memmap(
    'large_data.dat',
    dtype='float64',
    mode='w+',
    shape=(100000000, 100)  # 80GB of data
)

# Write data in chunks
chunk_size = 1000000
for i in range(0, len(mmap_array), chunk_size):
    end = min(i + chunk_size, len(mmap_array))
    mmap_array[i:end] = np.random.rand(end - i, 100)
    mmap_array.flush()  # Write to disk

# Process in chunks
chunk_sums = []
for i in range(0, len(mmap_array), chunk_size):
    end = min(i + chunk_size, len(mmap_array))
    chunk = mmap_array[i:end]
    chunk_sums.append(chunk.sum())

total_sum = np.sum(chunk_sums)

The array lives on disk. NumPy loads chunks into RAM as needed.

Solution 2: Out-of-Core Processing with Views

import numpy as np

# Create large file
data = np.memmap('data.dat', dtype='float32', mode='w+', shape=(1000000000,))

# Process with views (no extra memory)
for i in range(0, len(data), 10000000):
    chunk = data[i:i+10000000]
    chunk *= 2  # Process in place

# Flush changes
del data  # Closes the file

In-Place Operations: Saving Memory

import numpy as np

# Creates a new array (doubles memory usage)
data = np.random.rand(100000000)
result = data * 2

# In-place (no extra memory)
data = np.random.rand(100000000)
data *= 2

All in-place operators:

+=, -=, *=, /=, //=, **=, %=
np.add(a, b, out=a), np.multiply(a, b, out=a), etc.

The trick: Use the out parameter in ufuncs.

import numpy as np

a = np.random.rand(10000000)
b = np.random.rand(10000000)

# Creates new array
result = np.add(a, b)

# Reuses existing array
result = np.empty(len(a))
np.add(a, b, out=result)

# Or even better, reuse one of the inputs
np.add(a, b, out=a)  # Result stored in a

Einstein Summation (einsum): The Swiss Army Knife

np.einsum is incredibly powerful but confusing. Once you understand it, it replaces dozens of reshapes and transposes.

The Notation

einsum uses index notation. Letters represent dimensions.

import numpy as np

# Matrix multiplication
A = np.random.rand(3, 4)
B = np.random.rand(4, 5)

# Traditional way
result = A @ B

# With einsum
result = np.einsum('ij,jk->ik', A, B)

'ij,jk->ik' means:

First array has dimensions i,j
Second array has dimensions j,k
Sum over j (appears on both sides)
Result has dimensions i,k

Common Patterns

Trace (diagonal sum):

import numpy as np

matrix = np.random.rand(100, 100)

# Traditional
trace = np.trace(matrix)

# Einsum
trace = np.einsum('ii->', matrix)

Batch matrix multiplication:

import numpy as np

# 10 matrices of size 3x4
A = np.random.rand(10, 3, 4)

# 10 matrices of size 4x5
B = np.random.rand(10, 4, 5)

# Multiply corresponding pairs
result = np.einsum('nij,njk->nik', A, B)

# Result shape: (10, 3, 5)

Element-wise multiplication then sum:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

# Traditional
result = (A * B).sum()

# Einsum (faster)
result = np.einsum('ij,ij->', A, B)

Outer product of all vectors:

import numpy as np

vectors = np.random.rand(100, 3)

# Outer product of each vector with itself
outer_products = np.einsum('ni,nj->nij', vectors, vectors)

# Shape: (100, 3, 3)

When einsum is Faster

Operations that combine reshape + transpose + multiply
Batch operations
Complex tensor contractions

The trick: Use optimize=True for complex expressions.

import numpy as np

A = np.random.rand(50, 100)
B = np.random.rand(100, 75)
C = np.random.rand(75, 50)

# Without optimization
result = np.einsum('ij,jk,kl->il', A, B, C)

# With optimization (finds best evaluation order)
result = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)

The optimized version can be 10x faster for complex expressions.

Structured Arrays: DataFrames Without Pandas

Structured arrays let you store different data types in one array.

import numpy as np

# Define structure
dtype = [
    ('name', 'U20'),      # 20-character Unicode string
    ('age', 'i4'),        # 32-bit integer
    ('salary', 'f8')      # 64-bit float
]

# Create structured array
employees = np.array([
    ('Alice', 30, 75000.0),
    ('Bob', 25, 65000.0),
    ('Charlie', 35, 85000.0)
], dtype=dtype)

print(employees)
# [('Alice', 30, 75000.) ('Bob', 25, 65000.) ('Charlie', 35, 85000.)]

# Access by field name
print(employees['name'])
# ['Alice' 'Bob' 'Charlie']

# Filter by condition
high_earners = employees[employees['salary'] > 70000]
print(high_earners)
# [('Alice', 30, 75000.) ('Charlie', 35, 85000.)]

# Calculate on specific field
avg_salary = employees['salary'].mean()
print(f"Average salary: ${avg_salary:,.2f}")

Why use this instead of Pandas?

Lower memory overhead
Faster for simple operations
Better for interfacing with C libraries
No external dependency

The trick: Use np.sort with order parameter for multi-field sorting.

import numpy as np

dtype = [('name', 'U20'), ('age', 'i4'), ('salary', 'f8')]
employees = np.array([
    ('Alice', 30, 75000.0),
    ('Bob', 25, 85000.0),
    ('Charlie', 35, 65000.0)
], dtype=dtype)

# Sort by salary (descending), then age (ascending)
sorted_employees = np.sort(employees, order=['salary', 'age'])[::-1]
print(sorted_employees)

Masked Arrays: Handling Missing Data Properly

Masked arrays let you mark certain values as invalid without removing them.

import numpy as np
import numpy.ma as ma

# Data with invalid values
data = np.array([1.5, -999, 3.2, -999, 5.1, 2.8])

# Mark -999 as invalid
masked_data = ma.masked_equal(data, -999)

print(masked_data)
# [1.5 -- 3.2 -- 5.1 2.8]

# Operations skip masked values
print(masked_data.mean())
# 3.15 (only valid values)

print(masked_data.std())
# 1.43 (only valid values)

More masking options:

import numpy as np
import numpy.ma as ma

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Mask values outside range
masked = ma.masked_outside(data, 3, 7)
print(masked)
# [-- -- 3 4 5 6 7 -- -- --]

# Mask based on condition
masked = ma.masked_where(data > 5, data)
print(masked)
# [1 2 3 4 5 -- -- -- -- --]

# Mask invalid values (inf, nan)
data_with_nan = np.array([1.0, np.nan, 3.0, np.inf, 5.0])
masked = ma.masked_invalid(data_with_nan)
print(masked)
# [1.0 -- 3.0 -- 5.0]

The trick: You can fill masked values for operations.

import numpy as np
import numpy.ma as ma

data = np.array([1.5, -999, 3.2, -999, 5.1, 2.8])
masked_data = ma.masked_equal(data, -999)

# Fill masked values with mean of valid values
filled = masked_data.filled(masked_data.mean())
print(filled)
# [1.5  3.15 3.2  3.15 5.1  2.8]

Advanced Broadcasting Tricks

Broadcasting Multiple Arrays

import numpy as np

# 1D arrays of different sizes
a = np.array([1, 2, 3])              # Shape: (3,)
b = np.array([10, 20])               # Shape: (2,)
c = np.array([100, 200, 300, 400])   # Shape: (4,)

# Add dimensions to make them compatible
a_3d = a[:, np.newaxis, np.newaxis]  # Shape: (3, 1, 1)
b_3d = b[np.newaxis, :, np.newaxis]  # Shape: (1, 2, 1)
c_3d = c[np.newaxis, np.newaxis, :]  # Shape: (1, 1, 4)

# Broadcast together
result = a_3d + b_3d + c_3d

print(result.shape)  # (3, 2, 4)

This creates all combinations of a, b, and c.

Using ellipsis (...) for Broadcasting

import numpy as np

# 4D array
data = np.random.rand(10, 20, 30, 40)

# Normalize along last axis
mean = data.mean(axis=-1, keepdims=True)
std = data.std(axis=-1, keepdims=True)
normalized = (data - mean) / std

# Using ellipsis (cleaner for high dimensions)
mean = data.mean(axis=-1)[..., np.newaxis]
std = data.std(axis=-1)[..., np.newaxis]
normalized = (data - mean) / std

... means "all other dimensions". Cleaner than counting dimensions.

Performance Profiling

Finding Bottlenecks

import numpy as np
import time

def slow_function():
    data = np.random.rand(10000, 10000)
    result = np.zeros_like(data)
    for i in range(data.shape[0]):
        result[i] = data[i] ** 2
    return result

def fast_function():
    data = np.random.rand(10000, 10000)
    return data ** 2

# Profile slow version
start = time.time()
slow_result = slow_function()
slow_time = time.time() - start

# Profile fast version
start = time.time()
fast_result = fast_function()
fast_time = time.time() - start

print(f"Slow: {slow_time:.3f}s")
print(f"Fast: {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")

Using line_profiler

pip install line_profiler

import numpy as np

@profile
def process_data():
    data = np.random.rand(10000, 10000)
    mean = data.mean(axis=0)
    std = data.std(axis=0)
    normalized = (data - mean) / std
    return normalized.sum()

process_data()

Run with:

kernprof -l -v script.py

Shows line-by-line timing.

Interfacing with C (When NumPy Isn't Fast Enough)

Using Numba for JIT Compilation

import numpy as np
from numba import jit

# Pure Python (slow)
def python_loop(n):
    result = 0
    for i in range(n):
        result += i ** 2
    return result

# Numba JIT compiled (fast)
@jit(nopython=True)
def numba_loop(n):
    result = 0
    for i in range(n):
        result += i ** 2
    return result

# First call compiles, second call is fast
result = numba_loop(10000000)  # Compilation happens here
result = numba_loop(10000000)  # 100x faster than Python

When to use Numba:

Loops you can't vectorize
Complex logic that's hard to express in NumPy
Need C-like speed without writing C

Using Cython

# file: fast_ops.pyx
import numpy as np
cimport numpy as np

def sum_array(np.ndarray[np.float64_t, ndim=1] arr):
    cdef int i
    cdef double total = 0.0
    cdef int n = arr.shape[0]

    for i in range(n):
        total += arr[i]

    return total

Compile with:

cython fast_ops.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
    -I/usr/include/python3.9 -o fast_ops.so fast_ops.c

10-100x faster than pure Python loops.

Real-World Optimization Example

Let's optimize a real data processing pipeline.

Task: Calculate rolling correlation between two time series.

Naive approach:

import numpy as np

def rolling_correlation_slow(x, y, window):
    n = len(x)
    result = np.zeros(n - window + 1)

    for i in range(len(result)):
        window_x = x[i:i+window]
        window_y = y[i:i+window]
        result[i] = np.corrcoef(window_x, window_y)[0, 1]

    return result

# Test with 100,000 points, window of 100
x = np.random.rand(100000)
y = np.random.rand(100000)

# Takes 5+ seconds
corr = rolling_correlation_slow(x, y, 100)

Optimized approach:

import numpy as np
from numpy.lib.stride_tricks import as_strided

def rolling_correlation_fast(x, y, window):
    # Create sliding windows with strides (no copy)
    n = len(x) - window + 1

    x_windows = as_strided(
        x, shape=(n, window),
        strides=(x.itemsize, x.itemsize)
    )

    y_windows = as_strided(
        y, shape=(n, window),
        strides=(y.itemsize, y.itemsize)
    )

    # Vectorized correlation calculation
    x_mean = x_windows.mean(axis=1, keepdims=True)
    y_mean = y_windows.mean(axis=1, keepdims=True)

    x_centered = x_windows - x_mean
    y_centered = y_windows - y_mean

    numerator = (x_centered * y_centered).sum(axis=1)
    denominator = np.sqrt(
        (x_centered ** 2).sum(axis=1) * 
        (y_centered ** 2).sum(axis=1)
    )

    return numerator / denominator

# Takes 0.1 seconds (50x faster)
corr = rolling_correlation_fast(x, y, 100)

Key optimizations:

Stride tricks instead of slicing (no copies)
Vectorized operations instead of loops
Broadcasting for mean subtraction
Single pass through data

NumPy Gotchas That Break Production Code

Gotcha 1: Integer Division Behavior

import numpy as np

a = np.array([5, 7, 9])
b = np.array([2, 3, 4])

result = a / b
print(result)
# [2.5 2.33 2.25]  # Float division (Python 3 behavior)

# But with integer arrays
a_int = np.array([5, 7, 9], dtype=int)
b_int = np.array([2, 3, 4], dtype=int)

result_int = a_int / b_int
print(result_int)
# [2.5 2.33 2.25]  # Still float! (NumPy 1.x behavior)

# Use // for integer division
result_floor = a_int // b_int
print(result_floor)
# [2 2 2]

The fix: Always be explicit about division types.

Gotcha 2: Array Equality

import numpy as np

a = np.array([1, 2, 3])
b = np.array([1, 2, 3])

# This is NOT a boolean
if a == b:  # Raises ValueError in some contexts
    print("Equal")

# Correct ways
if np.array_equal(a, b):
    print("Equal")

if (a == b).all():
    print("Equal")

Gotcha 3: Modifying Arrays During Iteration

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# Dangerous
for i in range(len(arr)):
    if arr[i] > 2:
        arr[i] = 0  # Modifying while iterating

# Safe (vectorized)
arr[arr > 2] = 0

Gotcha 4: Floating Point Precision

import numpy as np

# Looks like it should equal 0.3
result = 0.1 + 0.1 + 0.1
print(result)
# 0.30000000000000004

# Use np.isclose for comparisons
print(np.isclose(result, 0.3))
# True

# Or allclose for arrays
a = np.array([0.1, 0.2, 0.3])
b = a + 1e-10
print(np.allclose(a, b))
# True

Summary

Advanced NumPy is about:

Understanding memory layout (C vs F order matters)
Manipulating strides for zero-copy views
Using einsum for complex tensor operations
Avoiding temporary arrays with in-place operations
Profiling to find real bottlenecks

Performance tricks that matter:

Memory-mapped arrays for out-of-core processing
Stride tricks for sliding windows
Integer array indexing for complex filtering
np.select for vectorizing conditional logic
The out parameter to reuse arrays

When to use what:

Structured arrays for simple tabular data
Masked arrays for handling missing values
einsum for complex linear algebra
Numba when loops can't be avoided
Memory mapping when data exceeds RAM

The golden rules:

Profile before optimizing
Vectorize when possible
Use views instead of copies
Be explicit about dtypes
Test with realistic data sizes

You now have the tools to write NumPy code that runs 10-100x faster than naive approaches. Use them wisely.

Resources:

NumPy internals: https://numpy.org/doc/stable/reference/internals.html
Advanced indexing: https://numpy.org/doc/stable/user/basics.indexing.html
Stride tricks: https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html
Numba documentation: https://numba.pydata.org/

DEV Community