Nithin Bharadwaj

Posted on Jan 30

5 Proven Techniques to Make Python Numerical Code Run as Fast as C

#programming #devto #python #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Python can feel slow when you're doing heavy math or working with lots of data. I've been there, staring at a progress bar, wondering if there's a better way. The good news is there is. By using a few key techniques, you can make your numerical Python code run dramatically faster, sometimes as fast as languages like C, without leaving the comfort of Python.

This is about working smarter, not harder. It’s about letting specialized libraries do the heavy lifting for you. Let's look at some of the most effective methods.

The first and most important concept is vectorization. Think about how you multiply two lists of numbers in plain Python. You'd probably write a loop. Now imagine that loop running a million times. It's slow because Python is checking and interpreting every single step. Vectorization is the process of replacing these explicit loops with operations on entire arrays. A library called NumPy does this by pushing the actual computation into pre-compiled, optimized C code under the hood.

Here’s a direct comparison. We'll multiply two huge arrays, first with a Python loop, then with a vectorized operation.

import numpy as np
import time

# Create two arrays with 10 million random numbers each
size = 10_000_000
a = np.random.randn(size)
b = np.random.randn(size)

# The slow way: a Python loop
def scalar_multiply(a, b):
    result = np.empty_like(a)
    for i in range(len(a)):
        result[i] = a[i] * b[i]
    return result

# The fast way: a vectorized operation
def vectorized_multiply(a, b):
    return a * b  # NumPy handles the loop internally

# Let's time them
start = time.time()
scalar_result = scalar_multiply(a, b)
scalar_time = time.time() - start

start = time.time()
vectorized_result = vectorized_multiply(a, b)
vectorized_time = time.time() - start

print(f"Loop time: {scalar_time:.3f} seconds")
print(f"Vectorized time: {vectorized_time:.3f} seconds")
print(f"Speedup: {scalar_time/vectorized_time:.1f} times faster")

When I run this, the vectorized version is typically over 100 times faster. The a * b line looks simple, but it's a command that NumPy executes all at once in efficient, low-level memory. This works for all kinds of mathematical operations: addition, subtraction, trigonometric functions, and more. The rule is simple: if you find yourself writing a loop over array elements to do math, check if NumPy can do it in one go.

Sometimes, your problem doesn't neatly fit into a vectorized operation. You have a complex function with lots of if statements and loops. This is where just-in-time (JIT) compilation saves the day. A library called Numba can take a regular Python function and, the first time you run it, compile it directly to machine code. After that, it runs at speeds comparable to C.

Let's estimate the value of π using a Monte Carlo method, which involves a lot of random number generation and checks.

import numba
import numpy as np
import time

# A pure Python version
def pure_python_pi(num_samples):
    inside = 0
    for _ in range(num_samples):
        x = np.random.random()
        y = np.random.random()
        if x**2 + y**2 <= 1.0:
            inside += 1
    return 4.0 * inside / num_samples

# The same function, but we add a magic decorator from Numba
@numba.jit(nopython=True)
def compiled_pi(num_samples):
    inside = 0
    for _ in range(num_samples):
        x = np.random.random()
        y = np.random.random()
        if x**2 + y**2 <= 1.0:
            inside += 1
    return 4.0 * inside / num_samples

# Warm up the JIT compiler by running it once
compiled_pi(1000)

num_samples = 1_000_000
start = time.time()
pi_pure = pure_python_pi(num_samples)
pure_time = time.time() - start

start = time.time()
pi_fast = compiled_pi(num_samples)
fast_time = time.time() - start

print(f"Pure Python: {pi_pure:.6f} in {pure_time:.3f}s")
print(f"Numba JIT:   {pi_fast:.6f} in {fast_time:.3f}s")
print(f"Speedup: {pure_time/fast_time:.1f}x")

The @jit(nopython=True) decorator is the key. It tells Numba to compile the function so it runs without any Python involvement. The first run has a small overhead for compilation, but every subsequent run is lightning fast. I use this for numerical kernels, the small, compute-intensive parts of my code that get called millions of times.

What happens when your dataset is too big to fit in your computer's RAM? You can't load it all at once. The solution is to pretend you can. Memory mapping creates a bridge between a file on your hard drive and an array in your Python program. You work with the array as if it's entirely in memory, but the operating system quietly loads and unloads chunks of data from the disk as needed.

It's like having a giant bookshelf (your hard drive) and only keeping the book you're reading on your desk (your RAM). Here's how you set it up.

import numpy as np
import os
import tempfile

# Create a temporary file to act as our "disk"
temp_file = tempfile.NamedTemporaryFile(suffix='.dat', delete=False)
temp_file.close()

# Let's say we want a huge array: 10,000 by 10,000 of 64-bit floats.
# That's about 800 MB. We create a memory map to it.
shape = (10000, 10000)
large_array = np.memmap(temp_file.name, dtype='float64', mode='w+', shape=shape)

# Now we can fill it. To avoid memory issues, we fill it in chunks.
chunk_size = 1000
for i in range(0, shape[0], chunk_size):
    end_i = min(i + chunk_size, shape[0])
    # Create some sample data for this chunk
    chunk = np.random.randn(end_i - i, shape[1])
    large_array[i:end_i, :] = chunk
    print(f"Written rows {i} to {end_i}...")
    large_array.flush()  # Make sure it's written to disk

# The array lives on disk. We can "close" and reopen it.
del large_array

# Re-open in read-only mode to analyze
large_array_ro = np.memmap(temp_file.name, dtype='float64', mode='r', shape=shape)

# Compute the mean without loading everything. NumPy is smart about this.
mean_val = large_array_ro.mean()
print(f"The mean of the entire massive array is: {mean_val:.4f}")

# Clean up the temporary file
os.unlink(temp_file.name)

This technique is a lifesaver for processing large datasets, like scientific measurements or financial history, that are dozens of gigabytes in size.

Modern computers have multiple processor cores, but a typical Python script only uses one. Parallel processing is about dividing your work and farming it out to all the cores simultaneously. The concurrent.futures module makes this relatively straightforward for problems that can be split into independent pieces.

Imagine you need to apply a complex function to every element in a long list. Doing it one by one is slow. Doing it 8 at a time on an 8-core machine is about 8 times faster.

import concurrent.futures
import math
import time

def complex_calculation(value):
    """A function that takes a while to compute."""
    time.sleep(0.01)  # Simulate some hard work
    return math.sqrt(math.exp(math.sin(value)))

def process_serial(data):
    """Process the list one item at a time."""
    results = []
    for item in data:
        results.append(complex_calculation(item))
    return results

def process_parallel(data, num_workers=4):
    """Process the list using multiple cores."""
    results = []
    # Create a pool of workers
    with concurrent.futures.ProcessPoolExecutor(max_workers=num_workers) as executor:
        # Submit all tasks to the pool
        future_to_item = {executor.submit(complex_calculation, item): item for item in data}
        # Collect results as they complete
        for future in concurrent.futures.as_completed(future_to_item):
            results.append(future.result())
    return results

# Create a list of data to process
data_list = list(range(100))

print("Starting serial processing...")
start = time.time()
serial_results = process_serial(data_list)
serial_time = time.time() - start

print(f"Starting parallel processing with 4 workers...")
start = time.time()
parallel_results = process_parallel(data_list, num_workers=4)
parallel_time = time.time() - start

print(f"Serial time:   {serial_time:.2f} seconds")
print(f"Parallel time: {parallel_time:.2f} seconds")
print(f"Speedup: {serial_time/parallel_time:.2f}x")
print(f"Results match: {all(s == p for s, p in zip(sorted(serial_results), sorted(parallel_results)))}")

The key is that complex_calculation(item) must be completely independent for each item. This is called an "embarrassingly parallel" problem and is perfect for this technique. If the tasks need to communicate, it gets more complex, but for many numerical tasks, this simple pattern works wonders.

In many scientific problems, especially those involving networks or differential equations, you deal with massive matrices that are mostly zeros. Storing and multiplying a 100,000 x 100,000 matrix of zeros is a waste of memory and time. Sparse matrix formats store only the non-zero values and their locations.

The SciPy library provides these formats. The most common is Compressed Sparse Row (CSR). Let's see the memory savings.

import scipy.sparse as sp
import numpy as np

# Create a dense matrix, mostly zeros
n = 10000  # 10,000 rows and columns
dense_matrix = np.zeros((n, n))

# Put a few random non-zero values on a diagonal band
for i in range(n):
    for j in range(max(0, i-1), min(n, i+2)):  # 3 diagonals
        dense_matrix[i, j] = np.random.randn()

print(f"Dense matrix size: {dense_matrix.shape}")
print(f"Dense matrix memory: {dense_matrix.nbytes / 1e6:.2f} MB")

# Convert to sparse CSR format
sparse_matrix = sp.csr_matrix(dense_matrix)

print(f"\nSparse matrix (CSR) format")
print(f"Number of non-zero entries: {sparse_matrix.nnz}")
# Memory used is roughly: nnz * 8 bytes for data + nnz * 8 bytes for indices + (n+1)*8 bytes for pointers
sparse_memory_est = (sparse_matrix.nnz * 8 * 2 + (n + 1) * 8) / 1e6
print(f"Estimated memory: {sparse_memory_est:.2f} MB")
print(f"Compression ratio: {dense_matrix.nbytes / (sparse_memory_est * 1e6):.1f}x")

# The real power is in operations. Let's multiply.
vector = np.random.randn(n)

start = time.time()
dense_result = dense_matrix.dot(vector)
dense_time = time.time() - start

start = time.time()
sparse_result = sparse_matrix.dot(vector)
sparse_time = time.time() - start

print(f"\nMatrix-vector multiplication:")
print(f"Dense time:  {dense_time:.4f}s")
print(f"Sparse time: {sparse_time:.4f}s")
print(f"Speedup: {dense_time/sparse_time:.1f}x")
print(f"Results match: {np.allclose(dense_result, sparse_result)}")

For very large, sparse problems, like simulating a large circuit or solving a partial differential equation over a fine grid, using sparse matrices isn't just an optimization; it's the only feasible way to solve them.

Often in engineering and science, you need to find the integral (area under a curve) or derivative (slope) of a function that's too complicated for pen-and-paper calculus. Numerical methods give us approximate answers.

For integration, SciPy's quad function is a workhorse. It's adaptive, meaning it uses more points where the function is wiggly and fewer where it's flat.

from scipy import integrate
import numpy as np

# A function that's hard to integrate by hand
def tricky_function(x):
    return np.exp(-x**2) * np.log(1 + x**2)

# Integrate from 0 to 5
result, estimated_error = integrate.quad(tricky_function, 0, 5)
print(f"Integral of f(x) from 0 to 5:")
print(f"Result: {result:.10f}")
print(f"Estimated absolute error: {estimated_error:.2e}")

# For derivatives, we can use finite differences.
# The central difference method is a good balance of accuracy and stability.
def numerical_derivative(f, x, h=1e-5):
    """Calculate derivative using central difference."""
    return (f(x + h) - f(x - h)) / (2 * h)

# Let's test it on a known function: sin(x). Its derivative is cos(x).
x_value = 1.0
true_derivative = np.cos(x_value)
our_derivative = numerical_derivative(np.sin, x_value)

print(f"\nDerivative of sin(x) at x = {x_value}:")
print(f"True value (cos(x)):  {true_derivative:.10f}")
print(f"Our approximation:    {our_derivative:.10f}")
print(f"Difference:           {abs(true_derivative - our_derivative):.2e}")

These tools let you move forward with modeling physical systems even when the math gets messy. The errors are usually well-controlled and small enough for practical purposes.

The final common task is optimization: finding the minimum or maximum of a function. This could be minimizing cost, maximizing efficiency, or fitting a model to data. SciPy offers a toolbox of algorithms.

A classic test problem is the Rosenbrock function, a curved valley that's hard for simple algorithms to navigate.

from scipy import optimize
import numpy as np

def rosenbrock(x):
    """The Rosenbrock 'banana' function. Minimum at (1, 1)."""
    return 100 * (x[1] - x[0]**2)**2 + (1 - x[0])**2

# Let's start from a difficult point
initial_guess = [-1.2, 1.0]

# Use the Nelder-Mead simplex algorithm (robust, doesn't need derivatives)
result = optimize.minimize(rosenbrock, initial_guess, method='Nelder-Mead')

print("Minimizing the Rosenbrock function.")
print(f"Started at: [{initial_guess[0]}, {initial_guess[1]}]")
print(f"Algorithm found a minimum at: [{result.x[0]:.6f}, {result.x[1]:.6f}]")
print(f"The function value there is: {result.fun:.2e}")
print(f"Success: {result.success}")
print(f"Message: {result.message}")

# If we know the gradient, we can use faster methods.
def rosenbrock_grad(x):
    """Gradient of the Rosenbrock function."""
    dx = -400 * x[0] * (x[1] - x[0]**2) - 2 * (1 - x[0])
    dy = 200 * (x[1] - x[0]**2)
    return np.array([dx, dy])

result_with_grad = optimize.minimize(rosenbrock, initial_guess, method='BFGS', jac=rosenbrock_grad)
print(f"\nWith gradient info (BFGS method):")
print(f"Found minimum at: [{result_with_grad.x[0]:.6f}, {result_with_grad.x[1]:.6f}]")
print(f"Function evaluations needed: {result.nfev} (Nelder) vs {result_with_grad.nfev} (BFGS)")

Choosing the right optimizer depends on your problem. If you have smooth functions and gradients, gradient-based methods like BFGS are very fast. For noisy or irregular functions, simplex or Powell methods are more reliable.

Putting it all together is where the real power lies. A typical workflow in my projects might look like this: I use memory-mapped arrays (np.memmap) to load a large dataset from disk. I then use vectorized NumPy operations to clean and filter it. The core numerical model, which might involve a custom loop, gets supercharged with a Numba @jit decorator. If the model needs to be run with thousands of different parameters, I spread those runs across CPU cores using ProcessPoolExecutor. The model itself might solve a system of equations built with SciPy's sparse matrices, and the final step could involve using scipy.optimize to fit the model's output to real-world data.

Each technique solves a specific performance bottleneck. Vectorization speeds up element-wise math. JIT compilation accelerates complex, custom loops. Memory mapping lets you work with data bigger than RAM. Parallel processing uses all your CPU cores. Sparse matrices save memory and time on data that's mostly empty. Numerical integration, differentiation, and optimization provide the mathematical tools you need. By understanding and applying these methods, you can transform Python from a prototyping tool into a powerful engine for high-performance computing.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

We are on Medium

DEV Community

5 Proven Techniques to Make Python Numerical Code Run as Fast as C

101 Books

Our Creations

We are on Medium

Top comments (0)