DEV Community

Cover image for **8 Python Performance Optimization Techniques That Transform Scientific Computing Speed**
Nithin Bharadwaj
Nithin Bharadwaj

Posted on

**8 Python Performance Optimization Techniques That Transform Scientific Computing Speed**

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Python's flexibility makes it a go-to choice for scientific computing, but performance optimization becomes essential when dealing with large-scale numerical computations. I've spent countless hours optimizing scientific code, and these eight techniques consistently deliver the most significant performance improvements while maintaining code clarity.

NumPy Vectorization: The Foundation of Fast Computing

When I first started scientific programming, I wrote loops everywhere. The transformation to vectorized operations changed everything. NumPy vectorization eliminates explicit loops by applying operations to entire arrays simultaneously, leveraging optimized C implementations and SIMD instructions.

import numpy as np
import time

def demonstrate_vectorization():
    # Generate large dataset
    size = 1000000
    data = np.random.rand(size)

    # Slow approach with explicit loop
    start = time.time()
    result_slow = np.zeros(size)
    for i in range(size):
        result_slow[i] = np.sin(data[i]) * np.cos(data[i])
    loop_time = time.time() - start

    # Fast vectorized approach
    start = time.time()
    result_fast = np.sin(data) * np.cos(data)
    vector_time = time.time() - start

    print(f"Loop time: {loop_time:.4f}s")
    print(f"Vectorized time: {vector_time:.4f}s")
    print(f"Speedup: {loop_time / vector_time:.1f}x")

    # Verify results are identical
    print(f"Results match: {np.allclose(result_slow, result_fast)}")

# Advanced vectorization with broadcasting
def compute_distance_matrix(points):
    """Compute pairwise distances efficiently"""
    # points shape: (n_points, n_dimensions)
    # Broadcasting creates (n, 1, d) - (1, n, d) = (n, n, d)
    diff = points[:, np.newaxis, :] - points[np.newaxis, :, :]
    distances = np.sqrt(np.sum(diff**2, axis=2))
    return distances

# Example usage
points = np.random.rand(1000, 3)
distances = compute_distance_matrix(points)
print(f"Distance matrix shape: {distances.shape}")
Enter fullscreen mode Exit fullscreen mode

The performance difference is staggering. Vectorized operations typically run 10-100 times faster than equivalent Python loops because they push the computation down to optimized C code that can utilize modern CPU features.

Just-In-Time Compilation with Numba

Numba transforms Python functions into optimized machine code at runtime. The @jit decorator provides near-C performance without changing Python syntax, making it perfect for numerical algorithms that can't be easily vectorized.

from numba import jit, prange, njit
import numpy as np

@njit
def mandelbrot_iteration(c, max_iter=100):
    """Compute Mandelbrot set membership (JIT compiled)"""
    z = 0
    for i in range(max_iter):
        if abs(z) > 2:
            return i
        z = z*z + c
    return max_iter

@njit(parallel=True)
def generate_mandelbrot(width, height, x_min, x_max, y_min, y_max, max_iter=100):
    """Generate Mandelbrot set using parallel JIT compilation"""
    result = np.zeros((height, width))

    for i in prange(height):
        for j in prange(width):
            # Map pixel to complex plane
            real = x_min + (x_max - x_min) * j / width
            imag = y_min + (y_max - y_min) * i / height
            c = complex(real, imag)

            result[i, j] = mandelbrot_iteration(c, max_iter)

    return result

# Custom data structures with Numba
from numba import types
from numba.typed import Dict

@njit
def optimized_histogram(data, bins):
    """Fast histogram computation with JIT"""
    hist = np.zeros(bins)
    bin_width = (data.max() - data.min()) / bins

    for value in data:
        bin_idx = int((value - data.min()) / bin_width)
        if bin_idx >= bins:
            bin_idx = bins - 1
        hist[bin_idx] += 1

    return hist

# Performance comparison
def compare_mandelbrot_performance():
    width, height = 800, 600
    bounds = (-2.5, 1.0, -1.5, 1.5)

    # Time JIT compiled version
    start = time.time()
    mandelbrot_fast = generate_mandelbrot(width, height, *bounds)
    jit_time = time.time() - start

    print(f"JIT compiled Mandelbrot: {jit_time:.4f}s")
    return mandelbrot_fast
Enter fullscreen mode Exit fullscreen mode

Numba excels at mathematical computations, nested loops, and algorithms that resist vectorization. The parallel=True parameter automatically distributes work across CPU cores for additional speedup.

Memory Layout Optimization

Memory access patterns significantly impact performance. Modern CPUs have complex cache hierarchies, and optimizing data layout to work with these caches can provide substantial speedups.

import numpy as np

def demonstrate_memory_layout():
    # Create test matrices
    size = 1000
    matrix_c = np.random.rand(size, size)  # C-contiguous (row-major)
    matrix_f = np.asfortranarray(matrix_c)  # Fortran-contiguous (column-major)

    print(f"C-contiguous flags: {matrix_c.flags}")
    print(f"F-contiguous flags: {matrix_f.flags}")

    # Row-wise operations (favor C-contiguous)
    start = time.time()
    for i in range(size):
        row_sum = np.sum(matrix_c[i, :])
    c_row_time = time.time() - start

    start = time.time()
    for i in range(size):
        row_sum = np.sum(matrix_f[i, :])
    f_row_time = time.time() - start

    # Column-wise operations (favor F-contiguous)
    start = time.time()
    for j in range(size):
        col_sum = np.sum(matrix_c[:, j])
    c_col_time = time.time() - start

    start = time.time()
    for j in range(size):
        col_sum = np.sum(matrix_f[:, j])
    f_col_time = time.time() - start

    print(f"Row operations - C: {c_row_time:.4f}s, F: {f_row_time:.4f}s")
    print(f"Column operations - C: {c_col_time:.4f}s, F: {f_col_time:.4f}s")

# Cache-friendly matrix multiplication
def cache_friendly_matmul(A, B, block_size=64):
    """Block-wise matrix multiplication for better cache usage"""
    n, m, p = A.shape[0], A.shape[1], B.shape[1]
    C = np.zeros((n, p))

    for i0 in range(0, n, block_size):
        for j0 in range(0, p, block_size):
            for k0 in range(0, m, block_size):
                # Define block boundaries
                i1 = min(i0 + block_size, n)
                j1 = min(j0 + block_size, p)
                k1 = min(k0 + block_size, m)

                # Multiply blocks
                C[i0:i1, j0:j1] += A[i0:i1, k0:k1] @ B[k0:k1, j0:j1]

    return C

# Memory pre-allocation for better performance
class PerformanceBuffer:
    """Reusable buffer to avoid memory allocation overhead"""
    def __init__(self, max_size):
        self.buffer = np.empty(max_size)
        self.max_size = max_size

    def get_view(self, size):
        if size > self.max_size:
            raise ValueError(f"Requested size {size} exceeds buffer size {self.max_size}")
        return self.buffer[:size]

# Usage example
buffer = PerformanceBuffer(1000000)
working_array = buffer.get_view(50000)
# Use working_array for computations without allocation cost
Enter fullscreen mode Exit fullscreen mode

Understanding memory layout helps optimize algorithms for modern hardware. C-contiguous arrays perform better for row-wise operations, while Fortran-contiguous arrays excel at column-wise access patterns.

Sparse Matrix Operations

Many scientific problems involve matrices with mostly zero values. Sparse matrix representations dramatically reduce memory usage and computation time by storing only non-zero elements.

import scipy.sparse as sp
import numpy as np

def sparse_matrix_efficiency():
    """Demonstrate sparse matrix advantages"""
    # Create large sparse matrix (finite difference operator)
    n = 10000
    main_diag = -2 * np.ones(n)
    off_diag = np.ones(n-1)

    # Create sparse tridiagonal matrix
    sparse_matrix = sp.diags([off_diag, main_diag, off_diag], 
                            [-1, 0, 1], 
                            shape=(n, n), 
                            format='csr')

    # Memory comparison
    dense_matrix = sparse_matrix.toarray()
    sparse_memory = sparse_matrix.data.nbytes + sparse_matrix.indices.nbytes + sparse_matrix.indptr.nbytes
    dense_memory = dense_matrix.nbytes

    print(f"Sparse matrix memory: {sparse_memory / 1024**2:.2f} MB")
    print(f"Dense matrix memory: {dense_memory / 1024**2:.2f} MB")
    print(f"Memory reduction: {dense_memory / sparse_memory:.1f}x")

    # Performance comparison for matrix-vector multiplication
    vector = np.random.rand(n)

    start = time.time()
    result_sparse = sparse_matrix @ vector
    sparse_time = time.time() - start

    start = time.time()
    result_dense = dense_matrix @ vector
    dense_time = time.time() - start

    print(f"Sparse multiplication: {sparse_time:.6f}s")
    print(f"Dense multiplication: {dense_time:.6f}s")
    print(f"Speedup: {dense_time / sparse_time:.1f}x")

# Sparse linear system solving
def solve_sparse_system():
    """Efficient sparse linear system solution"""
    from scipy.sparse.linalg import spsolve
    from scipy.sparse import csc_matrix

    # Create sparse coefficient matrix (Poisson equation discretization)
    n = 1000
    A = sp.diags([-1, 4, -1], [-1, 0, 1], shape=(n, n), format='csc')
    b = np.random.rand(n)

    # Solve sparse system
    start = time.time()
    x_sparse = spsolve(A, b)
    sparse_solve_time = time.time() - start

    # Compare with dense solve
    A_dense = A.toarray()
    start = time.time()
    x_dense = np.linalg.solve(A_dense, b)
    dense_solve_time = time.time() - start

    print(f"Sparse solve time: {sparse_solve_time:.6f}s")
    print(f"Dense solve time: {dense_solve_time:.6f}s")
    print(f"Solution error: {np.linalg.norm(x_sparse - x_dense):.2e}")

# Custom sparse matrix operations
def create_sparse_graph_laplacian(adjacency_matrix):
    """Create graph Laplacian from adjacency matrix"""
    degree_matrix = sp.diags(np.array(adjacency_matrix.sum(axis=1)).flatten())
    laplacian = degree_matrix - adjacency_matrix
    return laplacian

# Generate example graph
n_nodes = 5000
density = 0.001
adjacency = sp.random(n_nodes, n_nodes, density=density, format='csr')
adjacency = (adjacency + adjacency.T) / 2  # Make symmetric
laplacian = create_sparse_graph_laplacian(adjacency)
Enter fullscreen mode Exit fullscreen mode

Sparse matrices are essential for large-scale scientific computing. They enable solving systems with millions of unknowns that would be impossible with dense representations due to memory constraints.

Parallel Processing Strategies

Modern computers have multiple CPU cores, and parallel processing can dramatically accelerate computations. Python offers several approaches for parallel execution, each suited to different problem types.

import multiprocessing as mp
from concurrent.futures import ProcessPoolExecutor, ThreadPoolExecutor
import numpy as np
from numba import prange, jit

def cpu_intensive_task(data_chunk):
    """Simulate CPU-intensive scientific computation"""
    result = np.zeros_like(data_chunk)
    for i, value in enumerate(data_chunk):
        # Complex mathematical operation
        result[i] = np.sum([np.sin(value * k) * np.cos(value / k) 
                           for k in range(1, 100)])
    return result

def parallel_processing_comparison():
    """Compare different parallel processing approaches"""
    # Generate test data
    data_size = 10000
    data = np.random.rand(data_size)
    n_processes = mp.cpu_count()
    chunk_size = data_size // n_processes

    # Serial processing
    start = time.time()
    result_serial = cpu_intensive_task(data)
    serial_time = time.time() - start

    # Multiprocessing with Pool
    start = time.time()
    with ProcessPoolExecutor(max_workers=n_processes) as executor:
        chunks = [data[i:i+chunk_size] for i in range(0, data_size, chunk_size)]
        results = list(executor.map(cpu_intensive_task, chunks))
        result_parallel = np.concatenate(results)
    parallel_time = time.time() - start

    print(f"Serial time: {serial_time:.4f}s")
    print(f"Parallel time: {parallel_time:.4f}s")
    print(f"Speedup: {serial_time / parallel_time:.1f}x")
    print(f"Efficiency: {(serial_time / parallel_time) / n_processes:.1%}")

@jit(nopython=True, parallel=True)
def parallel_matrix_operations(matrix):
    """Numba parallel matrix processing"""
    rows, cols = matrix.shape
    result = np.zeros_like(matrix)

    for i in prange(rows):
        for j in range(cols):
            # Complex operation on each element
            result[i, j] = np.sqrt(matrix[i, j]**2 + 
                                 np.sin(matrix[i, j]) + 
                                 np.cos(matrix[i, j]))
    return result

# Parallel Monte Carlo simulation
@jit(nopython=True)
def monte_carlo_pi_chunk(n_samples):
    """Estimate π using Monte Carlo method"""
    count = 0
    for _ in range(n_samples):
        x = np.random.random()
        y = np.random.random()
        if x*x + y*y <= 1.0:
            count += 1
    return count

def parallel_monte_carlo_pi(total_samples, n_processes=None):
    """Parallel Monte Carlo π estimation"""
    if n_processes is None:
        n_processes = mp.cpu_count()

    samples_per_process = total_samples // n_processes

    with ProcessPoolExecutor(max_workers=n_processes) as executor:
        futures = [executor.submit(monte_carlo_pi_chunk, samples_per_process) 
                  for _ in range(n_processes)]

        total_hits = sum(future.result() for future in futures)

    pi_estimate = 4.0 * total_hits / total_samples
    return pi_estimate

# Usage example
if __name__ == "__main__":
    pi_estimate = parallel_monte_carlo_pi(10000000)
    print(f"π estimate: {pi_estimate:.6f}")
    print(f"Error: {abs(pi_estimate - np.pi):.6f}")
Enter fullscreen mode Exit fullscreen mode

The key to effective parallelization is matching the approach to the problem structure. CPU-bound tasks benefit from multiprocessing, while I/O-bound operations often work better with threading.

BLAS Integration and Linear Algebra Optimization

NumPy automatically uses optimized BLAS (Basic Linear Algebra Subprograms) libraries when available. These implementations provide hardware-specific optimizations for fundamental mathematical operations that form the backbone of scientific computing.

import numpy as np
import scipy.linalg.lapack as lapack

def demonstrate_blas_performance():
    """Show BLAS optimization benefits"""
    # Check which BLAS library NumPy is using
    print("NumPy configuration:")
    np.show_config()

    # Large matrix operations that benefit from BLAS
    size = 2000
    A = np.random.rand(size, size)
    B = np.random.rand(size, size)

    # Matrix multiplication (uses GEMM from BLAS)
    start = time.time()
    C = A @ B
    matmul_time = time.time() - start

    # Eigenvalue decomposition (uses LAPACK)
    start = time.time()
    eigenvals, eigenvecs = np.linalg.eigh(A @ A.T)
    eigen_time = time.time() - start

    print(f"Matrix multiplication ({size}x{size}): {matmul_time:.4f}s")
    print(f"Eigenvalue decomposition: {eigen_time:.4f}s")

# Optimized linear algebra operations
def solve_linear_systems_efficiently():
    """Demonstrate efficient linear system solutions"""
    n = 1000

    # Symmetric positive definite system (use Cholesky)
    A_spd = np.random.rand(n, n)
    A_spd = A_spd @ A_spd.T  # Make SPD
    b = np.random.rand(n)

    # Cholesky decomposition for SPD matrices
    start = time.time()
    L = np.linalg.cholesky(A_spd)
    y = np.linalg.solve(L, b)
    x_chol = np.linalg.solve(L.T, y)
    chol_time = time.time() - start

    # General LU decomposition
    start = time.time()
    x_lu = np.linalg.solve(A_spd, b)
    lu_time = time.time() - start

    print(f"Cholesky solve: {chol_time:.6f}s")
    print(f"LU solve: {lu_time:.6f}s")
    print(f"Speedup: {lu_time / chol_time:.1f}x")
    print(f"Solution error: {np.linalg.norm(x_chol - x_lu):.2e}")

# Matrix operations with different data types
def optimize_data_types():
    """Demonstrate performance impact of data types"""
    size = 1000

    # Different precision levels
    A_f64 = np.random.rand(size, size).astype(np.float64)
    A_f32 = A_f64.astype(np.float32)

    # Float64 operations
    start = time.time()
    result_f64 = A_f64 @ A_f64.T
    f64_time = time.time() - start

    # Float32 operations (often faster on modern hardware)
    start = time.time()
    result_f32 = A_f32 @ A_f32.T
    f32_time = time.time() - start

    print(f"Float64 computation: {f64_time:.6f}s")
    print(f"Float32 computation: {f32_time:.6f}s")
    print(f"Speedup: {f64_time / f32_time:.1f}x")
    print(f"Memory usage ratio: {A_f64.nbytes / A_f32.nbytes:.1f}x")

# Advanced BLAS operations
def advanced_blas_operations():
    """Use advanced BLAS routines directly"""
    from scipy.linalg import blas

    # Direct BLAS calls for maximum performance
    m, n, k = 1000, 800, 1200
    A = np.random.rand(m, k)
    B = np.random.rand(k, n)
    C = np.zeros((m, n))

    # Use GEMM directly (General Matrix Multiply)
    start = time.time()
    blas.dgemm(alpha=1.0, a=A, b=B, c=C, overwrite_c=True)
    blas_time = time.time() - start

    # Compare with NumPy
    start = time.time()
    C_numpy = A @ B
    numpy_time = time.time() - start

    print(f"Direct BLAS: {blas_time:.6f}s")
    print(f"NumPy: {numpy_time:.6f}s")
    print(f"Results match: {np.allclose(C, C_numpy)}")
Enter fullscreen mode Exit fullscreen mode

BLAS integration provides the foundation for high-performance scientific computing. Understanding which operations benefit from optimized libraries helps guide algorithm design decisions.

Memory Mapping for Large Datasets

Memory mapping allows processing datasets larger than available RAM by loading portions on-demand. This technique is crucial for analyzing massive scientific datasets without memory constraints.

import numpy as np
import h5py
from pathlib import Path

def create_large_dataset():
    """Create a large HDF5 dataset for demonstration"""
    filename = "large_scientific_data.h5"
    n_samples = 1000000
    n_features = 100

    with h5py.File(filename, 'w') as f:
        # Create dataset with chunking and compression
        dataset = f.create_dataset('data', 
                                 shape=(n_samples, n_features),
                                 dtype=np.float32,
                                 chunks=True,
                                 compression='gzip')

        # Fill with synthetic scientific data
        for i in range(0, n_samples, 10000):
            end_idx = min(i + 10000, n_samples)
            chunk_data = np.random.randn(end_idx - i, n_features).astype(np.float32)
            dataset[i:end_idx] = chunk_data

        # Add metadata
        dataset.attrs['description'] = 'Synthetic scientific measurement data'
        dataset.attrs['units'] = 'arbitrary units'
        dataset.attrs['sampling_rate'] = 1000.0

    return filename

def memory_mapped_analysis(filename):
    """Analyze large dataset using memory mapping"""
    with h5py.File(filename, 'r') as f:
        data = f['data']
        n_samples, n_features = data.shape

        print(f"Dataset shape: {data.shape}")
        print(f"Dataset size: {data.nbytes / 1024**3:.2f} GB")

        # Compute statistics without loading entire dataset
        chunk_size = 10000
        running_sum = np.zeros(n_features)
        running_sum_sq = np.zeros(n_features)

        for i in range(0, n_samples, chunk_size):
            end_idx = min(i + chunk_size, n_samples)
            chunk = data[i:end_idx]

            running_sum += np.sum(chunk, axis=0)
            running_sum_sq += np.sum(chunk**2, axis=0)

        # Calculate mean and standard deviation
        mean = running_sum / n_samples
        variance = (running_sum_sq / n_samples) - mean**2
        std = np.sqrt(variance)

        print(f"Mean values (first 5 features): {mean[:5]}")
        print(f"Std values (first 5 features): {std[:5]}")

# NumPy memory mapping
def numpy_memmap_example():
    """Demonstrate NumPy memory mapping"""
    # Create large array on disk
    filename = "large_array.dat"
    shape = (100000, 1000)

    # Create memory-mapped array
    mmap_array = np.memmap(filename, dtype='float32', mode='w+', shape=shape)

    # Fill with data (only loads necessary pages)
    for i in range(0, shape[0], 1000):
        end_idx = min(i + 1000, shape[0])
        mmap_array[i:end_idx] = np.random.randn(end_idx - i, shape[1])

    # Perform operations on memory-mapped array
    start = time.time()
    column_means = np.mean(mmap_array, axis=0)
    operation_time = time.time() - start

    print(f"Memory-mapped operation time: {operation_time:.4f}s")
    print(f"Column means shape: {column_means.shape}")

    # Clean up
    del mmap_array
    Path(filename).unlink(missing_ok=True)

# Streaming data processing
class StreamingProcessor:
    """Process large datasets in streaming fashion"""

    def __init__(self, chunk_size=10000):
        self.chunk_size = chunk_size
        self.processed_samples = 0

    def process_chunk(self, chunk):
        """Process a single chunk of data"""
        # Example: compute chunk statistics
        chunk_stats = {
            'mean': np.mean(chunk, axis=0),
            'std': np.std(chunk, axis=0),
            'min': np.min(chunk, axis=0),
            'max': np.max(chunk, axis=0)
        }
        return chunk_stats

    def process_file(self, filename):
        """Process entire file in chunks"""
        results = []

        with h5py.File(filename, 'r') as f:
            data = f['data']
            n_samples = data.shape[0]

            for i in range(0, n_samples, self.chunk_size):
                end_idx = min(i + self.chunk_size, n_samples)
                chunk = data[i:end_idx]

                chunk_result = self.process_chunk(chunk)
                results.append(chunk_result)

                self.processed_samples += chunk.shape[0]

                if i % (self.chunk_size * 10) == 0:
                    print(f"Processed {self.processed_samples}/{n_samples} samples")

        return results

# Usage example
def demonstrate_memory_mapping():
    # Create test dataset
    filename = create_large_dataset()

    # Analyze using memory mapping
    memory_mapped_analysis(filename)

    # Process in streaming fashion
    processor = StreamingProcessor(chunk_size=50000)
    results = processor.process_file(filename)

    print(f"Processed {len(results)} chunks")

    # Clean up
    Path(filename).unlink(missing_ok=True)
Enter fullscreen mode Exit fullscreen mode

Memory mapping transforms how we handle large-scale scientific data. Instead of being limited by RAM, we can process datasets of arbitrary size with predictable memory usage.

Profile-Guided Optimization

Profiling identifies computational bottlenecks and guides optimization efforts toward the most impactful improvements. Python provides several excellent profiling tools for scientific code analysis.


python
import cProfile
import pstats
import line_profiler
import numpy as np
from functools import wraps

def profile_function(func):
    """Decorator for profiling individual functions"""
    @wraps(func)
    def wrapper(*args, **kwargs):
        profiler = cProfile.Profile()
        profiler.enable()

        result = func(*args, **kwargs)

        profiler.disable()
        stats = pstats.Stats(profiler)
        stats.sort_stats('cumulative')
        stats.print_stats(10)  # Top 10 functions

        return result
    return wrapper

# Scientific computation example for profiling
@profile_function
def complex_scientific_computation(n=1000):
    """Example scientific computation with performance bottlenecks"""

    # Bottleneck 1: Inefficient matrix creation
    def create_matrix_slow(size):
        matrix = np.zeros((size, size))
        for i in range(size):
            for j in range(size):
                matrix[i, j] = np.sin(i) * np.cos(j)
        return matrix

    # Bottleneck 2: Redundant computations
    def redundant_computation(matrix):
        result = np.zeros_like(matrix)
        for i in range(matrix.shape[0]):
            for j in range(matrix.shape[1]):
                result[i, j] = np.exp(matrix[i, j]) + np.log(abs(matrix[i, j]) + 1)
        return result

    # Bottleneck 3: Memory allocation in loop
    def memory_inefficient_operation(data):
        results = []
        for i in range(len(data)):
            temp = np.array([data[i] * k for k in range(100)])
            results.append(np.sum(temp))
        return np.array(results)

    # Execute bottlenecked operations
    matrix = create_matrix_slow(n)
    processed = redundant_computation(matrix)
    final_result = memory_inefficient_operation(processed.flatten()[:1000])

    return final_result

# Optimized version for comparison
@profile_function
def optimized_scientific_computation(n=1000):
    """Optimized version of the same computation"""

    # Optimization 1: Vectorized matrix creation
    i_vals = np.arange(n)[:, np.newaxis]
    j_vals = np.arange(n)[np.newaxis, :]
    matrix = np.sin(i_vals) * np.cos(j_vals)

    # Optimization 2: Vectorized computation
    processed = np.exp(matrix) + np.log(np.abs(matrix) + 1)

    # Optimization 3: Pre-allocated arrays and vectorization
    data = processed.flatten()[:1000]
    k_vals = np.arange(100)
    temp_matrix = data[:, np.newaxis] * k_vals[np.newaxis, :]
    final_result = np.sum(temp_matrix, axis=1)

    return final_result

# Memory profiling
def memory_profiling_example():
    """Demonstrate memory usage profiling"""
    import tracemalloc

    # Start tracing
    tracemalloc.start()

    # Memory-intensive operations
    large_arrays = []
    for i in range(10):
        array = np.random.rand(1000, 1000)
        transformed = np.fft.fft2(array)
        large_arrays.append(transformed)

    # Get memory statistics
    current, peak = tracemall
---
📘 **Checkout my [latest ebook](https://youtu.be/WpR6F4ky4uM) for free on my channel!**  
Be sure to **like**, **share**, **comment**, and **subscribe** to the channel!

---
## 101 Books

**101 Books** is an AI-driven publishing company co-founded by author **Aarav Joshi**. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as **$4**—making quality knowledge accessible to everyone.

Check out our book **[Golang Clean Code](https://www.amazon.com/dp/B0DQQF9K3Z)** available on Amazon. 

Stay tuned for updates and exciting news. When shopping for books, search for **Aarav Joshi** to find more of our titles. Use the provided link to enjoy **special discounts**!

## Our Creations

Be sure to check out our creations:

**[Investor Central](https://www.investorcentral.co.uk/)** | **[Investor Central Spanish](https://spanish.investorcentral.co.uk/)** | **[Investor Central German](https://german.investorcentral.co.uk/)** | **[Smart Living](https://smartliving.investorcentral.co.uk/)** | **[Epochs & Echoes](https://epochsandechoes.com/)** | **[Puzzling Mysteries](https://www.puzzlingmysteries.com/)** | **[Hindutva](http://hindutva.epochsandechoes.com/)** | **[Elite Dev](https://elitedev.in/)** | **[JS Schools](https://jsschools.com/)**

---

### We are on Medium

**[Tech Koala Insights](https://techkoalainsights.com/)** | **[Epochs & Echoes World](https://world.epochsandechoes.com/)** | **[Investor Central Medium](https://medium.investorcentral.co.uk/)** | **[Puzzling Mysteries Medium](https://medium.com/puzzling-mysteries)** | **[Science & Epochs Medium](https://science.epochsandechoes.com/)** | **[Modern Hindutva](https://modernhindutva.substack.com/)**

Enter fullscreen mode Exit fullscreen mode

Top comments (0)