You've been using NumPy for a while now. You understand broadcasting. You avoid loops. Your code works.
But then you run it on real data. A million rows. Ten million. Your script that worked fine on 1,000 samples suddenly takes 10 minutes. Or worse, it crashes with a memory error.
You start profiling. Turns out, that one innocent looking line is eating 90% of your runtime. The operation you thought was O(n) is actually O(n²). Your "vectorized" code is still allocating gigabytes of temporary arrays you didn't know existed.
Here's the truth. Knowing the basics of NumPy gets you 80% of the way there. The last 20% is where the real performance lives. It's in the details nobody talks about. The memory layout tricks. The stride manipulation. The C-order vs Fortran-order gotchas. The obscure functions that solve problems in one line instead of fifty.
Let me show you the advanced stuff.
Memory Layout: Why Your Fast Code Is Actually Slow
NumPy arrays are stored in contiguous memory blocks. But there are two ways to arrange that memory.
Row-Major (C) vs Column-Major (Fortran)
import numpy as np
# C-contiguous (row-major, default)
c_array = np.arange(12).reshape(3, 4)
print(c_array)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
print(c_array.flags)
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False
# Memory layout in C-order: [0,1,2,3,4,5,6,7,8,9,10,11]
# Rows are stored together
import numpy as np
# Fortran-contiguous (column-major)
f_array = np.asfortranarray(c_array)
print(f_array.flags)
# C_CONTIGUOUS : False
# F_CONTIGUOUS : True
# Memory layout in F-order: [0,4,8,1,5,9,2,6,10,3,7,11]
# Columns are stored together
Why this matters:
import numpy as np
import time
# Create large arrays
size = 10000
c_array = np.arange(size * size).reshape(size, size)
f_array = np.asfortranarray(c_array)
# Row operations on C-array (fast)
start = time.time()
for i in range(size):
row_sum = c_array[i, :].sum()
c_time = time.time() - start
# Row operations on F-array (slow)
start = time.time()
for i in range(size):
row_sum = f_array[i, :].sum()
f_time = time.time() - start
print(f"C-array row ops: {c_time:.3f}s")
print(f"F-array row ops: {f_time:.3f}s")
print(f"F-array is {f_time/c_time:.1f}x slower for rows")
# Column operations on F-array (fast)
start = time.time()
for i in range(size):
col_sum = f_array[:, i].sum()
f_time_col = time.time() - start
# Column operations on C-array (slow)
start = time.time()
for i in range(size):
col_sum = c_array[:, i].sum()
c_time_col = time.time() - start
print(f"\nF-array column ops: {f_time_col:.3f}s")
print(f"C-array column ops: {c_time_col:.3f}s")
print(f"C-array is {c_time_col/f_time_col:.1f}x slower for columns")
On my machine, F-array row operations are 3x slower. C-array column operations are 3x slower.
The trick:
If you're doing heavy column operations (common in linear algebra, some machine learning), convert to Fortran order first.
import numpy as np
data = np.random.rand(10000, 10000)
# Converting to F-order for column operations
data_f = np.asfortranarray(data)
# Now column operations are 3x faster
column_means = data_f.mean(axis=0)
Strides: The Hidden Performance Lever
Every NumPy array has strides. These tell NumPy how many bytes to skip to move to the next element in each dimension.
import numpy as np
arr = np.arange(12).reshape(3, 4)
print(arr)
# [[ 0 1 2 3]
# [ 4 5 6 7]
# [ 8 9 10 11]]
print(arr.strides)
# (32, 8)
What does (32, 8) mean?
- To move to the next row, skip 32 bytes (4 elements Ć 8 bytes per int64)
- To move to the next column, skip 8 bytes (1 element)
The trick: You can manipulate strides to create views without copying data.
Creating a Sliding Window View
import numpy as np
from numpy.lib.stride_tricks import as_strided
data = np.arange(10)
print(data)
# [0 1 2 3 4 5 6 7 8 9]
# Create sliding window of size 3
window_size = 3
n_windows = len(data) - window_size + 1
# Manually set strides to create windows
windows = as_strided(
data,
shape=(n_windows, window_size),
strides=(data.itemsize, data.itemsize)
)
print(windows)
# [[0 1 2]
# [1 2 3]
# [2 3 4]
# [3 4 5]
# [4 5 6]
# [5 6 7]
# [6 7 8]
# [7 8 9]]
No data was copied. All windows are views into the original array. This is 100x faster than creating windows with a loop.
Real-world use case: Rolling averages
import numpy as np
from numpy.lib.stride_tricks import as_strided
data = np.random.rand(1000000)
window_size = 100
# Create sliding windows
n_windows = len(data) - window_size + 1
windows = as_strided(
data,
shape=(n_windows, window_size),
strides=(data.itemsize, data.itemsize)
)
# Calculate rolling average (mean of each window)
rolling_avg = windows.mean(axis=1)
print(f"Original size: {len(data)}")
print(f"Rolling averages: {len(rolling_avg)}")
This computes 999,901 averages instantly. No loops. No copies.
Warning: as_strided is dangerous. Invalid strides can read garbage memory or crash Python. Always validate your stride calculations.
Advanced Indexing: Beyond the Basics
Integer Array Indexing (The Confusing One)
import numpy as np
data = np.arange(20).reshape(4, 5)
print(data)
# [[ 0 1 2 3 4]
# [ 5 6 7 8 9]
# [10 11 12 13 14]
# [15 16 17 18 19]]
# Grab specific elements by (row, col) pairs
rows = np.array([0, 1, 2, 3])
cols = np.array([0, 1, 2, 3])
result = data[rows, cols]
print(result)
# [ 0 6 12 18] # Diagonal elements
This grabs data[0,0], data[1,1], data[2,2], data[3,3].
The trick: You can use this for advanced filtering.
import numpy as np
data = np.random.rand(1000, 100)
# Find row and column indices where value > 0.99
rows, cols = np.where(data > 0.99)
# Get those specific values
high_values = data[rows, cols]
print(f"Found {len(high_values)} values > 0.99")
np.ix_: Grid Indexing Made Easy
import numpy as np
data = np.arange(100).reshape(10, 10)
# Want rows [1, 3, 5] and columns [2, 4, 6]
# Without np.ix_, this is painful
# With np.ix_:
rows = np.array([1, 3, 5])
cols = np.array([2, 4, 6])
subgrid = data[np.ix_(rows, cols)]
print(subgrid)
# [[12 14 16]
# [32 34 36]
# [52 54 56]]
np.ix_ creates the right indexing arrays for you. No manual broadcasting needed.
Vectorization Patterns Nobody Teaches
Pattern 1: Vectorizing Distance Calculations
Problem: Calculate pairwise distances between all points.
Naive approach (loops):
import numpy as np
points = np.random.rand(1000, 2) # 1000 points in 2D
# Slow (nested loops)
distances = np.zeros((1000, 1000))
for i in range(1000):
for j in range(1000):
distances[i, j] = np.sqrt(np.sum((points[i] - points[j])**2))
# Takes seconds
Vectorized approach:
import numpy as np
points = np.random.rand(1000, 2)
# Expand dimensions for broadcasting
p1 = points[:, np.newaxis, :] # Shape: (1000, 1, 2)
p2 = points[np.newaxis, :, :] # Shape: (1, 1000, 2)
# Broadcast subtraction
diff = p1 - p2 # Shape: (1000, 1000, 2)
# Calculate distances
distances = np.sqrt((diff**2).sum(axis=2))
# Takes milliseconds
10-100x faster. No loops.
Even faster with einsum:
import numpy as np
points = np.random.rand(1000, 2)
# Using einsum for dot products
sq_distances = (
np.einsum('ij,ij->i', points, points)[:, np.newaxis] +
np.einsum('ij,ij->i', points, points)[np.newaxis, :] -
2 * np.einsum('ik,jk->ij', points, points)
)
distances = np.sqrt(sq_distances)
This is the fastest pure NumPy approach.
Pattern 2: Vectorizing Conditional Logic
Problem: Apply different formulas based on conditions.
import numpy as np
x = np.random.rand(1000000) * 10
# Slow (loop with if-else)
result = np.zeros(len(x))
for i in range(len(x)):
if x[i] < 3:
result[i] = x[i] ** 2
elif x[i] < 7:
result[i] = x[i] * 10
else:
result[i] = x[i] / 2
Vectorized with np.select:
import numpy as np
x = np.random.rand(1000000) * 10
# Define conditions
conditions = [
x < 3,
(x >= 3) & (x < 7),
x >= 7
]
# Define corresponding functions
choices = [
x ** 2,
x * 10,
x / 2
]
# Apply all at once
result = np.select(conditions, choices)
100x faster. One pass through the data instead of a million if-else checks.
Pattern 3: Vectorizing Cumulative Operations with Different Windows
Problem: Calculate moving average with varying window sizes.
import numpy as np
data = np.random.rand(10000)
window_sizes = np.random.randint(5, 100, size=10000)
# Can't use np.convolve because window size changes
# Need np.add.reduceat
This is where reduceat shines, but it's complex. Better approach: use numba for this case or accept a loop.
Memory-Efficient Operations
Problem: Operating on Arrays Larger Than RAM
You have a 50GB array. You have 16GB RAM. How do you process it?
Solution 1: Memory-Mapped Arrays
import numpy as np
# Create a memory-mapped array (stored on disk)
mmap_array = np.memmap(
'large_data.dat',
dtype='float64',
mode='w+',
shape=(100000000, 100) # 80GB of data
)
# Write data in chunks
chunk_size = 1000000
for i in range(0, len(mmap_array), chunk_size):
end = min(i + chunk_size, len(mmap_array))
mmap_array[i:end] = np.random.rand(end - i, 100)
mmap_array.flush() # Write to disk
# Process in chunks
chunk_sums = []
for i in range(0, len(mmap_array), chunk_size):
end = min(i + chunk_size, len(mmap_array))
chunk = mmap_array[i:end]
chunk_sums.append(chunk.sum())
total_sum = np.sum(chunk_sums)
The array lives on disk. NumPy loads chunks into RAM as needed.
Solution 2: Out-of-Core Processing with Views
import numpy as np
# Create large file
data = np.memmap('data.dat', dtype='float32', mode='w+', shape=(1000000000,))
# Process with views (no extra memory)
for i in range(0, len(data), 10000000):
chunk = data[i:i+10000000]
chunk *= 2 # Process in place
# Flush changes
del data # Closes the file
In-Place Operations: Saving Memory
import numpy as np
# Creates a new array (doubles memory usage)
data = np.random.rand(100000000)
result = data * 2
# In-place (no extra memory)
data = np.random.rand(100000000)
data *= 2
All in-place operators:
-
+=,-=,*=,/=,//=,**=,%= -
np.add(a, b, out=a),np.multiply(a, b, out=a), etc.
The trick: Use the out parameter in ufuncs.
import numpy as np
a = np.random.rand(10000000)
b = np.random.rand(10000000)
# Creates new array
result = np.add(a, b)
# Reuses existing array
result = np.empty(len(a))
np.add(a, b, out=result)
# Or even better, reuse one of the inputs
np.add(a, b, out=a) # Result stored in a
Einstein Summation (einsum): The Swiss Army Knife
np.einsum is incredibly powerful but confusing. Once you understand it, it replaces dozens of reshapes and transposes.
The Notation
einsum uses index notation. Letters represent dimensions.
import numpy as np
# Matrix multiplication
A = np.random.rand(3, 4)
B = np.random.rand(4, 5)
# Traditional way
result = A @ B
# With einsum
result = np.einsum('ij,jk->ik', A, B)
'ij,jk->ik' means:
- First array has dimensions i,j
- Second array has dimensions j,k
- Sum over j (appears on both sides)
- Result has dimensions i,k
Common Patterns
Trace (diagonal sum):
import numpy as np
matrix = np.random.rand(100, 100)
# Traditional
trace = np.trace(matrix)
# Einsum
trace = np.einsum('ii->', matrix)
Batch matrix multiplication:
import numpy as np
# 10 matrices of size 3x4
A = np.random.rand(10, 3, 4)
# 10 matrices of size 4x5
B = np.random.rand(10, 4, 5)
# Multiply corresponding pairs
result = np.einsum('nij,njk->nik', A, B)
# Result shape: (10, 3, 5)
Element-wise multiplication then sum:
import numpy as np
A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)
# Traditional
result = (A * B).sum()
# Einsum (faster)
result = np.einsum('ij,ij->', A, B)
Outer product of all vectors:
import numpy as np
vectors = np.random.rand(100, 3)
# Outer product of each vector with itself
outer_products = np.einsum('ni,nj->nij', vectors, vectors)
# Shape: (100, 3, 3)
When einsum is Faster
- Operations that combine reshape + transpose + multiply
- Batch operations
- Complex tensor contractions
The trick: Use optimize=True for complex expressions.
import numpy as np
A = np.random.rand(50, 100)
B = np.random.rand(100, 75)
C = np.random.rand(75, 50)
# Without optimization
result = np.einsum('ij,jk,kl->il', A, B, C)
# With optimization (finds best evaluation order)
result = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)
The optimized version can be 10x faster for complex expressions.
Structured Arrays: DataFrames Without Pandas
Structured arrays let you store different data types in one array.
import numpy as np
# Define structure
dtype = [
('name', 'U20'), # 20-character Unicode string
('age', 'i4'), # 32-bit integer
('salary', 'f8') # 64-bit float
]
# Create structured array
employees = np.array([
('Alice', 30, 75000.0),
('Bob', 25, 65000.0),
('Charlie', 35, 85000.0)
], dtype=dtype)
print(employees)
# [('Alice', 30, 75000.) ('Bob', 25, 65000.) ('Charlie', 35, 85000.)]
# Access by field name
print(employees['name'])
# ['Alice' 'Bob' 'Charlie']
# Filter by condition
high_earners = employees[employees['salary'] > 70000]
print(high_earners)
# [('Alice', 30, 75000.) ('Charlie', 35, 85000.)]
# Calculate on specific field
avg_salary = employees['salary'].mean()
print(f"Average salary: ${avg_salary:,.2f}")
Why use this instead of Pandas?
- Lower memory overhead
- Faster for simple operations
- Better for interfacing with C libraries
- No external dependency
The trick: Use np.sort with order parameter for multi-field sorting.
import numpy as np
dtype = [('name', 'U20'), ('age', 'i4'), ('salary', 'f8')]
employees = np.array([
('Alice', 30, 75000.0),
('Bob', 25, 85000.0),
('Charlie', 35, 65000.0)
], dtype=dtype)
# Sort by salary (descending), then age (ascending)
sorted_employees = np.sort(employees, order=['salary', 'age'])[::-1]
print(sorted_employees)
Masked Arrays: Handling Missing Data Properly
Masked arrays let you mark certain values as invalid without removing them.
import numpy as np
import numpy.ma as ma
# Data with invalid values
data = np.array([1.5, -999, 3.2, -999, 5.1, 2.8])
# Mark -999 as invalid
masked_data = ma.masked_equal(data, -999)
print(masked_data)
# [1.5 -- 3.2 -- 5.1 2.8]
# Operations skip masked values
print(masked_data.mean())
# 3.15 (only valid values)
print(masked_data.std())
# 1.43 (only valid values)
More masking options:
import numpy as np
import numpy.ma as ma
data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])
# Mask values outside range
masked = ma.masked_outside(data, 3, 7)
print(masked)
# [-- -- 3 4 5 6 7 -- -- --]
# Mask based on condition
masked = ma.masked_where(data > 5, data)
print(masked)
# [1 2 3 4 5 -- -- -- -- --]
# Mask invalid values (inf, nan)
data_with_nan = np.array([1.0, np.nan, 3.0, np.inf, 5.0])
masked = ma.masked_invalid(data_with_nan)
print(masked)
# [1.0 -- 3.0 -- 5.0]
The trick: You can fill masked values for operations.
import numpy as np
import numpy.ma as ma
data = np.array([1.5, -999, 3.2, -999, 5.1, 2.8])
masked_data = ma.masked_equal(data, -999)
# Fill masked values with mean of valid values
filled = masked_data.filled(masked_data.mean())
print(filled)
# [1.5 3.15 3.2 3.15 5.1 2.8]
Advanced Broadcasting Tricks
Broadcasting Multiple Arrays
import numpy as np
# 1D arrays of different sizes
a = np.array([1, 2, 3]) # Shape: (3,)
b = np.array([10, 20]) # Shape: (2,)
c = np.array([100, 200, 300, 400]) # Shape: (4,)
# Add dimensions to make them compatible
a_3d = a[:, np.newaxis, np.newaxis] # Shape: (3, 1, 1)
b_3d = b[np.newaxis, :, np.newaxis] # Shape: (1, 2, 1)
c_3d = c[np.newaxis, np.newaxis, :] # Shape: (1, 1, 4)
# Broadcast together
result = a_3d + b_3d + c_3d
print(result.shape) # (3, 2, 4)
This creates all combinations of a, b, and c.
Using ellipsis (...) for Broadcasting
import numpy as np
# 4D array
data = np.random.rand(10, 20, 30, 40)
# Normalize along last axis
mean = data.mean(axis=-1, keepdims=True)
std = data.std(axis=-1, keepdims=True)
normalized = (data - mean) / std
# Using ellipsis (cleaner for high dimensions)
mean = data.mean(axis=-1)[..., np.newaxis]
std = data.std(axis=-1)[..., np.newaxis]
normalized = (data - mean) / std
... means "all other dimensions". Cleaner than counting dimensions.
Performance Profiling
Finding Bottlenecks
import numpy as np
import time
def slow_function():
data = np.random.rand(10000, 10000)
result = np.zeros_like(data)
for i in range(data.shape[0]):
result[i] = data[i] ** 2
return result
def fast_function():
data = np.random.rand(10000, 10000)
return data ** 2
# Profile slow version
start = time.time()
slow_result = slow_function()
slow_time = time.time() - start
# Profile fast version
start = time.time()
fast_result = fast_function()
fast_time = time.time() - start
print(f"Slow: {slow_time:.3f}s")
print(f"Fast: {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")
Using line_profiler
pip install line_profiler
import numpy as np
@profile
def process_data():
data = np.random.rand(10000, 10000)
mean = data.mean(axis=0)
std = data.std(axis=0)
normalized = (data - mean) / std
return normalized.sum()
process_data()
Run with:
kernprof -l -v script.py
Shows line-by-line timing.
Interfacing with C (When NumPy Isn't Fast Enough)
Using Numba for JIT Compilation
import numpy as np
from numba import jit
# Pure Python (slow)
def python_loop(n):
result = 0
for i in range(n):
result += i ** 2
return result
# Numba JIT compiled (fast)
@jit(nopython=True)
def numba_loop(n):
result = 0
for i in range(n):
result += i ** 2
return result
# First call compiles, second call is fast
result = numba_loop(10000000) # Compilation happens here
result = numba_loop(10000000) # 100x faster than Python
When to use Numba:
- Loops you can't vectorize
- Complex logic that's hard to express in NumPy
- Need C-like speed without writing C
Using Cython
# file: fast_ops.pyx
import numpy as np
cimport numpy as np
def sum_array(np.ndarray[np.float64_t, ndim=1] arr):
cdef int i
cdef double total = 0.0
cdef int n = arr.shape[0]
for i in range(n):
total += arr[i]
return total
Compile with:
cython fast_ops.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
-I/usr/include/python3.9 -o fast_ops.so fast_ops.c
10-100x faster than pure Python loops.
Real-World Optimization Example
Let's optimize a real data processing pipeline.
Task: Calculate rolling correlation between two time series.
Naive approach:
import numpy as np
def rolling_correlation_slow(x, y, window):
n = len(x)
result = np.zeros(n - window + 1)
for i in range(len(result)):
window_x = x[i:i+window]
window_y = y[i:i+window]
result[i] = np.corrcoef(window_x, window_y)[0, 1]
return result
# Test with 100,000 points, window of 100
x = np.random.rand(100000)
y = np.random.rand(100000)
# Takes 5+ seconds
corr = rolling_correlation_slow(x, y, 100)
Optimized approach:
import numpy as np
from numpy.lib.stride_tricks import as_strided
def rolling_correlation_fast(x, y, window):
# Create sliding windows with strides (no copy)
n = len(x) - window + 1
x_windows = as_strided(
x, shape=(n, window),
strides=(x.itemsize, x.itemsize)
)
y_windows = as_strided(
y, shape=(n, window),
strides=(y.itemsize, y.itemsize)
)
# Vectorized correlation calculation
x_mean = x_windows.mean(axis=1, keepdims=True)
y_mean = y_windows.mean(axis=1, keepdims=True)
x_centered = x_windows - x_mean
y_centered = y_windows - y_mean
numerator = (x_centered * y_centered).sum(axis=1)
denominator = np.sqrt(
(x_centered ** 2).sum(axis=1) *
(y_centered ** 2).sum(axis=1)
)
return numerator / denominator
# Takes 0.1 seconds (50x faster)
corr = rolling_correlation_fast(x, y, 100)
Key optimizations:
- Stride tricks instead of slicing (no copies)
- Vectorized operations instead of loops
- Broadcasting for mean subtraction
- Single pass through data
NumPy Gotchas That Break Production Code
Gotcha 1: Integer Division Behavior
import numpy as np
a = np.array([5, 7, 9])
b = np.array([2, 3, 4])
result = a / b
print(result)
# [2.5 2.33 2.25] # Float division (Python 3 behavior)
# But with integer arrays
a_int = np.array([5, 7, 9], dtype=int)
b_int = np.array([2, 3, 4], dtype=int)
result_int = a_int / b_int
print(result_int)
# [2.5 2.33 2.25] # Still float! (NumPy 1.x behavior)
# Use // for integer division
result_floor = a_int // b_int
print(result_floor)
# [2 2 2]
The fix: Always be explicit about division types.
Gotcha 2: Array Equality
import numpy as np
a = np.array([1, 2, 3])
b = np.array([1, 2, 3])
# This is NOT a boolean
if a == b: # Raises ValueError in some contexts
print("Equal")
# Correct ways
if np.array_equal(a, b):
print("Equal")
if (a == b).all():
print("Equal")
Gotcha 3: Modifying Arrays During Iteration
import numpy as np
arr = np.array([1, 2, 3, 4, 5])
# Dangerous
for i in range(len(arr)):
if arr[i] > 2:
arr[i] = 0 # Modifying while iterating
# Safe (vectorized)
arr[arr > 2] = 0
Gotcha 4: Floating Point Precision
import numpy as np
# Looks like it should equal 0.3
result = 0.1 + 0.1 + 0.1
print(result)
# 0.30000000000000004
# Use np.isclose for comparisons
print(np.isclose(result, 0.3))
# True
# Or allclose for arrays
a = np.array([0.1, 0.2, 0.3])
b = a + 1e-10
print(np.allclose(a, b))
# True
Summary
Advanced NumPy is about:
- Understanding memory layout (C vs F order matters)
- Manipulating strides for zero-copy views
- Using einsum for complex tensor operations
- Avoiding temporary arrays with in-place operations
- Profiling to find real bottlenecks
Performance tricks that matter:
- Memory-mapped arrays for out-of-core processing
- Stride tricks for sliding windows
- Integer array indexing for complex filtering
- np.select for vectorizing conditional logic
- The
outparameter to reuse arrays
When to use what:
- Structured arrays for simple tabular data
- Masked arrays for handling missing values
- einsum for complex linear algebra
- Numba when loops can't be avoided
- Memory mapping when data exceeds RAM
The golden rules:
- Profile before optimizing
- Vectorize when possible
- Use views instead of copies
- Be explicit about dtypes
- Test with realistic data sizes
You now have the tools to write NumPy code that runs 10-100x faster than naive approaches. Use them wisely.
Resources:
- NumPy internals: https://numpy.org/doc/stable/reference/internals.html
- Advanced indexing: https://numpy.org/doc/stable/user/basics.indexing.html
- Stride tricks: https://numpy.org/doc/stable/reference/generated/numpy.lib.stride_tricks.as_strided.html
- Numba documentation: https://numba.pydata.org/
Top comments (0)