DEV Community

Muhammad Ikramullah Khan
Muhammad Ikramullah Khan

Posted on

Advanced NumPy: The Performance Tricks That Separate Pros From Beginners

You've been using NumPy for a while now. You understand broadcasting. You avoid loops. Your code works.

But then you run it on real data. A million rows. Ten million. Your script that worked fine on 1,000 samples suddenly takes 10 minutes. Or worse, it crashes with a memory error.

You start profiling. Turns out, that one innocent looking line is eating 90% of your runtime. The operation you thought was O(n) is actually O(n²). Your "vectorized" code is still allocating gigabytes of temporary arrays you didn't know existed.

Here's the truth. Knowing the basics of NumPy gets you 80% of the way there. The last 20% is where the real performance lives. It's in the details nobody talks about. The memory layout tricks. The stride manipulation. The C-order vs Fortran-order gotchas. The obscure functions that solve problems in one line instead of fifty.

Let me show you the advanced stuff.


Memory Layout: Why Your Fast Code Is Actually Slow

NumPy arrays are stored in contiguous memory blocks. But there are two ways to arrange that memory.

Row-Major (C) vs Column-Major (Fortran)

import numpy as np

# C-contiguous (row-major, default)
c_array = np.arange(12).reshape(3, 4)
print(c_array)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(c_array.flags)
# C_CONTIGUOUS : True
# F_CONTIGUOUS : False

# Memory layout in C-order: [0,1,2,3,4,5,6,7,8,9,10,11]
# Rows are stored together
Enter fullscreen mode Exit fullscreen mode
import numpy as np

# Fortran-contiguous (column-major)
f_array = np.asfortranarray(c_array)

print(f_array.flags)
# C_CONTIGUOUS : False
# F_CONTIGUOUS : True

# Memory layout in F-order: [0,4,8,1,5,9,2,6,10,3,7,11]
# Columns are stored together
Enter fullscreen mode Exit fullscreen mode

Why this matters:

import numpy as np
import time

# Create large arrays
size = 10000
c_array = np.arange(size * size).reshape(size, size)
f_array = np.asfortranarray(c_array)

# Row operations on C-array (fast)
start = time.time()
for i in range(size):
    row_sum = c_array[i, :].sum()
c_time = time.time() - start

# Row operations on F-array (slow)
start = time.time()
for i in range(size):
    row_sum = f_array[i, :].sum()
f_time = time.time() - start

print(f"C-array row ops: {c_time:.3f}s")
print(f"F-array row ops: {f_time:.3f}s")
print(f"F-array is {f_time/c_time:.1f}x slower for rows")

# Column operations on F-array (fast)
start = time.time()
for i in range(size):
    col_sum = f_array[:, i].sum()
f_time_col = time.time() - start

# Column operations on C-array (slow)
start = time.time()
for i in range(size):
    col_sum = c_array[:, i].sum()
c_time_col = time.time() - start

print(f"\nF-array column ops: {f_time_col:.3f}s")
print(f"C-array column ops: {c_time_col:.3f}s")
print(f"C-array is {c_time_col/f_time_col:.1f}x slower for columns")
Enter fullscreen mode Exit fullscreen mode

On my machine, F-array row operations are 3x slower. C-array column operations are 3x slower.

The trick:

If you're doing heavy column operations (common in linear algebra, some machine learning), convert to Fortran order first.

import numpy as np

data = np.random.rand(10000, 10000)

# Converting to F-order for column operations
data_f = np.asfortranarray(data)

# Now column operations are 3x faster
column_means = data_f.mean(axis=0)
Enter fullscreen mode Exit fullscreen mode

Strides: The Hidden Performance Lever

Every NumPy array has strides. These tell NumPy how many bytes to skip to move to the next element in each dimension.

import numpy as np

arr = np.arange(12).reshape(3, 4)
print(arr)
# [[ 0  1  2  3]
#  [ 4  5  6  7]
#  [ 8  9 10 11]]

print(arr.strides)
# (32, 8)
Enter fullscreen mode Exit fullscreen mode

What does (32, 8) mean?

  • To move to the next row, skip 32 bytes (4 elements Ɨ 8 bytes per int64)
  • To move to the next column, skip 8 bytes (1 element)

The trick: You can manipulate strides to create views without copying data.

Creating a Sliding Window View

import numpy as np
from numpy.lib.stride_tricks import as_strided

data = np.arange(10)
print(data)
# [0 1 2 3 4 5 6 7 8 9]

# Create sliding window of size 3
window_size = 3
n_windows = len(data) - window_size + 1

# Manually set strides to create windows
windows = as_strided(
    data,
    shape=(n_windows, window_size),
    strides=(data.itemsize, data.itemsize)
)

print(windows)
# [[0 1 2]
#  [1 2 3]
#  [2 3 4]
#  [3 4 5]
#  [4 5 6]
#  [5 6 7]
#  [6 7 8]
#  [7 8 9]]
Enter fullscreen mode Exit fullscreen mode

No data was copied. All windows are views into the original array. This is 100x faster than creating windows with a loop.

Real-world use case: Rolling averages

import numpy as np
from numpy.lib.stride_tricks import as_strided

data = np.random.rand(1000000)
window_size = 100

# Create sliding windows
n_windows = len(data) - window_size + 1
windows = as_strided(
    data,
    shape=(n_windows, window_size),
    strides=(data.itemsize, data.itemsize)
)

# Calculate rolling average (mean of each window)
rolling_avg = windows.mean(axis=1)

print(f"Original size: {len(data)}")
print(f"Rolling averages: {len(rolling_avg)}")
Enter fullscreen mode Exit fullscreen mode

This computes 999,901 averages instantly. No loops. No copies.

Warning: as_strided is dangerous. Invalid strides can read garbage memory or crash Python. Always validate your stride calculations.


Advanced Indexing: Beyond the Basics

Integer Array Indexing (The Confusing One)

import numpy as np

data = np.arange(20).reshape(4, 5)
print(data)
# [[ 0  1  2  3  4]
#  [ 5  6  7  8  9]
#  [10 11 12 13 14]
#  [15 16 17 18 19]]

# Grab specific elements by (row, col) pairs
rows = np.array([0, 1, 2, 3])
cols = np.array([0, 1, 2, 3])

result = data[rows, cols]
print(result)
# [ 0  6 12 18]  # Diagonal elements
Enter fullscreen mode Exit fullscreen mode

This grabs data[0,0], data[1,1], data[2,2], data[3,3].

The trick: You can use this for advanced filtering.

import numpy as np

data = np.random.rand(1000, 100)

# Find row and column indices where value > 0.99
rows, cols = np.where(data > 0.99)

# Get those specific values
high_values = data[rows, cols]

print(f"Found {len(high_values)} values > 0.99")
Enter fullscreen mode Exit fullscreen mode

np.ix_: Grid Indexing Made Easy

import numpy as np

data = np.arange(100).reshape(10, 10)

# Want rows [1, 3, 5] and columns [2, 4, 6]
# Without np.ix_, this is painful

# With np.ix_:
rows = np.array([1, 3, 5])
cols = np.array([2, 4, 6])

subgrid = data[np.ix_(rows, cols)]
print(subgrid)
# [[12 14 16]
#  [32 34 36]
#  [52 54 56]]
Enter fullscreen mode Exit fullscreen mode

np.ix_ creates the right indexing arrays for you. No manual broadcasting needed.


Vectorization Patterns Nobody Teaches

Pattern 1: Vectorizing Distance Calculations

Problem: Calculate pairwise distances between all points.

Naive approach (loops):

import numpy as np

points = np.random.rand(1000, 2)  # 1000 points in 2D

# Slow (nested loops)
distances = np.zeros((1000, 1000))
for i in range(1000):
    for j in range(1000):
        distances[i, j] = np.sqrt(np.sum((points[i] - points[j])**2))

# Takes seconds
Enter fullscreen mode Exit fullscreen mode

Vectorized approach:

import numpy as np

points = np.random.rand(1000, 2)

# Expand dimensions for broadcasting
p1 = points[:, np.newaxis, :]  # Shape: (1000, 1, 2)
p2 = points[np.newaxis, :, :]  # Shape: (1, 1000, 2)

# Broadcast subtraction
diff = p1 - p2  # Shape: (1000, 1000, 2)

# Calculate distances
distances = np.sqrt((diff**2).sum(axis=2))

# Takes milliseconds
Enter fullscreen mode Exit fullscreen mode

10-100x faster. No loops.

Even faster with einsum:

import numpy as np

points = np.random.rand(1000, 2)

# Using einsum for dot products
sq_distances = (
    np.einsum('ij,ij->i', points, points)[:, np.newaxis] +
    np.einsum('ij,ij->i', points, points)[np.newaxis, :] -
    2 * np.einsum('ik,jk->ij', points, points)
)

distances = np.sqrt(sq_distances)
Enter fullscreen mode Exit fullscreen mode

This is the fastest pure NumPy approach.

Pattern 2: Vectorizing Conditional Logic

Problem: Apply different formulas based on conditions.

import numpy as np

x = np.random.rand(1000000) * 10

# Slow (loop with if-else)
result = np.zeros(len(x))
for i in range(len(x)):
    if x[i] < 3:
        result[i] = x[i] ** 2
    elif x[i] < 7:
        result[i] = x[i] * 10
    else:
        result[i] = x[i] / 2
Enter fullscreen mode Exit fullscreen mode

Vectorized with np.select:

import numpy as np

x = np.random.rand(1000000) * 10

# Define conditions
conditions = [
    x < 3,
    (x >= 3) & (x < 7),
    x >= 7
]

# Define corresponding functions
choices = [
    x ** 2,
    x * 10,
    x / 2
]

# Apply all at once
result = np.select(conditions, choices)
Enter fullscreen mode Exit fullscreen mode

100x faster. One pass through the data instead of a million if-else checks.

Pattern 3: Vectorizing Cumulative Operations with Different Windows

Problem: Calculate moving average with varying window sizes.

import numpy as np

data = np.random.rand(10000)
window_sizes = np.random.randint(5, 100, size=10000)

# Can't use np.convolve because window size changes
# Need np.add.reduceat
Enter fullscreen mode Exit fullscreen mode

This is where reduceat shines, but it's complex. Better approach: use numba for this case or accept a loop.


Memory-Efficient Operations

Problem: Operating on Arrays Larger Than RAM

You have a 50GB array. You have 16GB RAM. How do you process it?

Solution 1: Memory-Mapped Arrays

import numpy as np

# Create a memory-mapped array (stored on disk)
mmap_array = np.memmap(
    'large_data.dat',
    dtype='float64',
    mode='w+',
    shape=(100000000, 100)  # 80GB of data
)

# Write data in chunks
chunk_size = 1000000
for i in range(0, len(mmap_array), chunk_size):
    end = min(i + chunk_size, len(mmap_array))
    mmap_array[i:end] = np.random.rand(end - i, 100)
    mmap_array.flush()  # Write to disk

# Process in chunks
chunk_sums = []
for i in range(0, len(mmap_array), chunk_size):
    end = min(i + chunk_size, len(mmap_array))
    chunk = mmap_array[i:end]
    chunk_sums.append(chunk.sum())

total_sum = np.sum(chunk_sums)
Enter fullscreen mode Exit fullscreen mode

The array lives on disk. NumPy loads chunks into RAM as needed.

Solution 2: Out-of-Core Processing with Views

import numpy as np

# Create large file
data = np.memmap('data.dat', dtype='float32', mode='w+', shape=(1000000000,))

# Process with views (no extra memory)
for i in range(0, len(data), 10000000):
    chunk = data[i:i+10000000]
    chunk *= 2  # Process in place

# Flush changes
del data  # Closes the file
Enter fullscreen mode Exit fullscreen mode

In-Place Operations: Saving Memory

import numpy as np

# Creates a new array (doubles memory usage)
data = np.random.rand(100000000)
result = data * 2

# In-place (no extra memory)
data = np.random.rand(100000000)
data *= 2
Enter fullscreen mode Exit fullscreen mode

All in-place operators:

  • +=, -=, *=, /=, //=, **=, %=
  • np.add(a, b, out=a), np.multiply(a, b, out=a), etc.

The trick: Use the out parameter in ufuncs.

import numpy as np

a = np.random.rand(10000000)
b = np.random.rand(10000000)

# Creates new array
result = np.add(a, b)

# Reuses existing array
result = np.empty(len(a))
np.add(a, b, out=result)

# Or even better, reuse one of the inputs
np.add(a, b, out=a)  # Result stored in a
Enter fullscreen mode Exit fullscreen mode

Einstein Summation (einsum): The Swiss Army Knife

np.einsum is incredibly powerful but confusing. Once you understand it, it replaces dozens of reshapes and transposes.

The Notation

einsum uses index notation. Letters represent dimensions.

import numpy as np

# Matrix multiplication
A = np.random.rand(3, 4)
B = np.random.rand(4, 5)

# Traditional way
result = A @ B

# With einsum
result = np.einsum('ij,jk->ik', A, B)
Enter fullscreen mode Exit fullscreen mode

'ij,jk->ik' means:

  • First array has dimensions i,j
  • Second array has dimensions j,k
  • Sum over j (appears on both sides)
  • Result has dimensions i,k

Common Patterns

Trace (diagonal sum):

import numpy as np

matrix = np.random.rand(100, 100)

# Traditional
trace = np.trace(matrix)

# Einsum
trace = np.einsum('ii->', matrix)
Enter fullscreen mode Exit fullscreen mode

Batch matrix multiplication:

import numpy as np

# 10 matrices of size 3x4
A = np.random.rand(10, 3, 4)

# 10 matrices of size 4x5
B = np.random.rand(10, 4, 5)

# Multiply corresponding pairs
result = np.einsum('nij,njk->nik', A, B)

# Result shape: (10, 3, 5)
Enter fullscreen mode Exit fullscreen mode

Element-wise multiplication then sum:

import numpy as np

A = np.random.rand(1000, 1000)
B = np.random.rand(1000, 1000)

# Traditional
result = (A * B).sum()

# Einsum (faster)
result = np.einsum('ij,ij->', A, B)
Enter fullscreen mode Exit fullscreen mode

Outer product of all vectors:

import numpy as np

vectors = np.random.rand(100, 3)

# Outer product of each vector with itself
outer_products = np.einsum('ni,nj->nij', vectors, vectors)

# Shape: (100, 3, 3)
Enter fullscreen mode Exit fullscreen mode

When einsum is Faster

  • Operations that combine reshape + transpose + multiply
  • Batch operations
  • Complex tensor contractions

The trick: Use optimize=True for complex expressions.

import numpy as np

A = np.random.rand(50, 100)
B = np.random.rand(100, 75)
C = np.random.rand(75, 50)

# Without optimization
result = np.einsum('ij,jk,kl->il', A, B, C)

# With optimization (finds best evaluation order)
result = np.einsum('ij,jk,kl->il', A, B, C, optimize=True)
Enter fullscreen mode Exit fullscreen mode

The optimized version can be 10x faster for complex expressions.


Structured Arrays: DataFrames Without Pandas

Structured arrays let you store different data types in one array.

import numpy as np

# Define structure
dtype = [
    ('name', 'U20'),      # 20-character Unicode string
    ('age', 'i4'),        # 32-bit integer
    ('salary', 'f8')      # 64-bit float
]

# Create structured array
employees = np.array([
    ('Alice', 30, 75000.0),
    ('Bob', 25, 65000.0),
    ('Charlie', 35, 85000.0)
], dtype=dtype)

print(employees)
# [('Alice', 30, 75000.) ('Bob', 25, 65000.) ('Charlie', 35, 85000.)]

# Access by field name
print(employees['name'])
# ['Alice' 'Bob' 'Charlie']

# Filter by condition
high_earners = employees[employees['salary'] > 70000]
print(high_earners)
# [('Alice', 30, 75000.) ('Charlie', 35, 85000.)]

# Calculate on specific field
avg_salary = employees['salary'].mean()
print(f"Average salary: ${avg_salary:,.2f}")
Enter fullscreen mode Exit fullscreen mode

Why use this instead of Pandas?

  • Lower memory overhead
  • Faster for simple operations
  • Better for interfacing with C libraries
  • No external dependency

The trick: Use np.sort with order parameter for multi-field sorting.

import numpy as np

dtype = [('name', 'U20'), ('age', 'i4'), ('salary', 'f8')]
employees = np.array([
    ('Alice', 30, 75000.0),
    ('Bob', 25, 85000.0),
    ('Charlie', 35, 65000.0)
], dtype=dtype)

# Sort by salary (descending), then age (ascending)
sorted_employees = np.sort(employees, order=['salary', 'age'])[::-1]
print(sorted_employees)
Enter fullscreen mode Exit fullscreen mode

Masked Arrays: Handling Missing Data Properly

Masked arrays let you mark certain values as invalid without removing them.

import numpy as np
import numpy.ma as ma

# Data with invalid values
data = np.array([1.5, -999, 3.2, -999, 5.1, 2.8])

# Mark -999 as invalid
masked_data = ma.masked_equal(data, -999)

print(masked_data)
# [1.5 -- 3.2 -- 5.1 2.8]

# Operations skip masked values
print(masked_data.mean())
# 3.15 (only valid values)

print(masked_data.std())
# 1.43 (only valid values)
Enter fullscreen mode Exit fullscreen mode

More masking options:

import numpy as np
import numpy.ma as ma

data = np.array([1, 2, 3, 4, 5, 6, 7, 8, 9, 10])

# Mask values outside range
masked = ma.masked_outside(data, 3, 7)
print(masked)
# [-- -- 3 4 5 6 7 -- -- --]

# Mask based on condition
masked = ma.masked_where(data > 5, data)
print(masked)
# [1 2 3 4 5 -- -- -- -- --]

# Mask invalid values (inf, nan)
data_with_nan = np.array([1.0, np.nan, 3.0, np.inf, 5.0])
masked = ma.masked_invalid(data_with_nan)
print(masked)
# [1.0 -- 3.0 -- 5.0]
Enter fullscreen mode Exit fullscreen mode

The trick: You can fill masked values for operations.

import numpy as np
import numpy.ma as ma

data = np.array([1.5, -999, 3.2, -999, 5.1, 2.8])
masked_data = ma.masked_equal(data, -999)

# Fill masked values with mean of valid values
filled = masked_data.filled(masked_data.mean())
print(filled)
# [1.5  3.15 3.2  3.15 5.1  2.8]
Enter fullscreen mode Exit fullscreen mode

Advanced Broadcasting Tricks

Broadcasting Multiple Arrays

import numpy as np

# 1D arrays of different sizes
a = np.array([1, 2, 3])              # Shape: (3,)
b = np.array([10, 20])               # Shape: (2,)
c = np.array([100, 200, 300, 400])   # Shape: (4,)

# Add dimensions to make them compatible
a_3d = a[:, np.newaxis, np.newaxis]  # Shape: (3, 1, 1)
b_3d = b[np.newaxis, :, np.newaxis]  # Shape: (1, 2, 1)
c_3d = c[np.newaxis, np.newaxis, :]  # Shape: (1, 1, 4)

# Broadcast together
result = a_3d + b_3d + c_3d

print(result.shape)  # (3, 2, 4)
Enter fullscreen mode Exit fullscreen mode

This creates all combinations of a, b, and c.

Using ellipsis (...) for Broadcasting

import numpy as np

# 4D array
data = np.random.rand(10, 20, 30, 40)

# Normalize along last axis
mean = data.mean(axis=-1, keepdims=True)
std = data.std(axis=-1, keepdims=True)
normalized = (data - mean) / std

# Using ellipsis (cleaner for high dimensions)
mean = data.mean(axis=-1)[..., np.newaxis]
std = data.std(axis=-1)[..., np.newaxis]
normalized = (data - mean) / std
Enter fullscreen mode Exit fullscreen mode

... means "all other dimensions". Cleaner than counting dimensions.


Performance Profiling

Finding Bottlenecks

import numpy as np
import time

def slow_function():
    data = np.random.rand(10000, 10000)
    result = np.zeros_like(data)
    for i in range(data.shape[0]):
        result[i] = data[i] ** 2
    return result

def fast_function():
    data = np.random.rand(10000, 10000)
    return data ** 2

# Profile slow version
start = time.time()
slow_result = slow_function()
slow_time = time.time() - start

# Profile fast version
start = time.time()
fast_result = fast_function()
fast_time = time.time() - start

print(f"Slow: {slow_time:.3f}s")
print(f"Fast: {fast_time:.3f}s")
print(f"Speedup: {slow_time/fast_time:.1f}x")
Enter fullscreen mode Exit fullscreen mode

Using line_profiler

pip install line_profiler
Enter fullscreen mode Exit fullscreen mode
import numpy as np

@profile
def process_data():
    data = np.random.rand(10000, 10000)
    mean = data.mean(axis=0)
    std = data.std(axis=0)
    normalized = (data - mean) / std
    return normalized.sum()

process_data()
Enter fullscreen mode Exit fullscreen mode

Run with:

kernprof -l -v script.py
Enter fullscreen mode Exit fullscreen mode

Shows line-by-line timing.


Interfacing with C (When NumPy Isn't Fast Enough)

Using Numba for JIT Compilation

import numpy as np
from numba import jit

# Pure Python (slow)
def python_loop(n):
    result = 0
    for i in range(n):
        result += i ** 2
    return result

# Numba JIT compiled (fast)
@jit(nopython=True)
def numba_loop(n):
    result = 0
    for i in range(n):
        result += i ** 2
    return result

# First call compiles, second call is fast
result = numba_loop(10000000)  # Compilation happens here
result = numba_loop(10000000)  # 100x faster than Python
Enter fullscreen mode Exit fullscreen mode

When to use Numba:

  • Loops you can't vectorize
  • Complex logic that's hard to express in NumPy
  • Need C-like speed without writing C

Using Cython

# file: fast_ops.pyx
import numpy as np
cimport numpy as np

def sum_array(np.ndarray[np.float64_t, ndim=1] arr):
    cdef int i
    cdef double total = 0.0
    cdef int n = arr.shape[0]

    for i in range(n):
        total += arr[i]

    return total
Enter fullscreen mode Exit fullscreen mode

Compile with:

cython fast_ops.pyx
gcc -shared -pthread -fPIC -fwrapv -O2 -Wall -fno-strict-aliasing \
    -I/usr/include/python3.9 -o fast_ops.so fast_ops.c
Enter fullscreen mode Exit fullscreen mode

10-100x faster than pure Python loops.


Real-World Optimization Example

Let's optimize a real data processing pipeline.

Task: Calculate rolling correlation between two time series.

Naive approach:

import numpy as np

def rolling_correlation_slow(x, y, window):
    n = len(x)
    result = np.zeros(n - window + 1)

    for i in range(len(result)):
        window_x = x[i:i+window]
        window_y = y[i:i+window]
        result[i] = np.corrcoef(window_x, window_y)[0, 1]

    return result

# Test with 100,000 points, window of 100
x = np.random.rand(100000)
y = np.random.rand(100000)

# Takes 5+ seconds
corr = rolling_correlation_slow(x, y, 100)
Enter fullscreen mode Exit fullscreen mode

Optimized approach:

import numpy as np
from numpy.lib.stride_tricks import as_strided

def rolling_correlation_fast(x, y, window):
    # Create sliding windows with strides (no copy)
    n = len(x) - window + 1

    x_windows = as_strided(
        x, shape=(n, window),
        strides=(x.itemsize, x.itemsize)
    )

    y_windows = as_strided(
        y, shape=(n, window),
        strides=(y.itemsize, y.itemsize)
    )

    # Vectorized correlation calculation
    x_mean = x_windows.mean(axis=1, keepdims=True)
    y_mean = y_windows.mean(axis=1, keepdims=True)

    x_centered = x_windows - x_mean
    y_centered = y_windows - y_mean

    numerator = (x_centered * y_centered).sum(axis=1)
    denominator = np.sqrt(
        (x_centered ** 2).sum(axis=1) * 
        (y_centered ** 2).sum(axis=1)
    )

    return numerator / denominator

# Takes 0.1 seconds (50x faster)
corr = rolling_correlation_fast(x, y, 100)
Enter fullscreen mode Exit fullscreen mode

Key optimizations:

  1. Stride tricks instead of slicing (no copies)
  2. Vectorized operations instead of loops
  3. Broadcasting for mean subtraction
  4. Single pass through data

NumPy Gotchas That Break Production Code

Gotcha 1: Integer Division Behavior

import numpy as np

a = np.array([5, 7, 9])
b = np.array([2, 3, 4])

result = a / b
print(result)
# [2.5 2.33 2.25]  # Float division (Python 3 behavior)

# But with integer arrays
a_int = np.array([5, 7, 9], dtype=int)
b_int = np.array([2, 3, 4], dtype=int)

result_int = a_int / b_int
print(result_int)
# [2.5 2.33 2.25]  # Still float! (NumPy 1.x behavior)

# Use // for integer division
result_floor = a_int // b_int
print(result_floor)
# [2 2 2]
Enter fullscreen mode Exit fullscreen mode

The fix: Always be explicit about division types.

Gotcha 2: Array Equality

import numpy as np

a = np.array([1, 2, 3])
b = np.array([1, 2, 3])

# This is NOT a boolean
if a == b:  # Raises ValueError in some contexts
    print("Equal")

# Correct ways
if np.array_equal(a, b):
    print("Equal")

if (a == b).all():
    print("Equal")
Enter fullscreen mode Exit fullscreen mode

Gotcha 3: Modifying Arrays During Iteration

import numpy as np

arr = np.array([1, 2, 3, 4, 5])

# Dangerous
for i in range(len(arr)):
    if arr[i] > 2:
        arr[i] = 0  # Modifying while iterating

# Safe (vectorized)
arr[arr > 2] = 0
Enter fullscreen mode Exit fullscreen mode

Gotcha 4: Floating Point Precision

import numpy as np

# Looks like it should equal 0.3
result = 0.1 + 0.1 + 0.1
print(result)
# 0.30000000000000004

# Use np.isclose for comparisons
print(np.isclose(result, 0.3))
# True

# Or allclose for arrays
a = np.array([0.1, 0.2, 0.3])
b = a + 1e-10
print(np.allclose(a, b))
# True
Enter fullscreen mode Exit fullscreen mode

Summary

Advanced NumPy is about:

  • Understanding memory layout (C vs F order matters)
  • Manipulating strides for zero-copy views
  • Using einsum for complex tensor operations
  • Avoiding temporary arrays with in-place operations
  • Profiling to find real bottlenecks

Performance tricks that matter:

  • Memory-mapped arrays for out-of-core processing
  • Stride tricks for sliding windows
  • Integer array indexing for complex filtering
  • np.select for vectorizing conditional logic
  • The out parameter to reuse arrays

When to use what:

  • Structured arrays for simple tabular data
  • Masked arrays for handling missing values
  • einsum for complex linear algebra
  • Numba when loops can't be avoided
  • Memory mapping when data exceeds RAM

The golden rules:

  • Profile before optimizing
  • Vectorize when possible
  • Use views instead of copies
  • Be explicit about dtypes
  • Test with realistic data sizes

You now have the tools to write NumPy code that runs 10-100x faster than naive approaches. Use them wisely.


Resources:

Top comments (0)