Stephen Emmanuel

Posted on Sep 16 • Originally published at blog.stephcrown.com

Vectorization in Python for Machine Learning

#python #machinelearning #vectorization #ai

Introduction

Imagine you need to double every number in a list of 1000 values. One approach is to take the first number, multiply it by 2, write down the result, then move to the second number, multiply it by 2, write it down, and repeat 998 more times. Another approach is to use a spreadsheet where you can select all 1000 cells at once, apply a "multiply by 2" formula to the entire selection, and watch all results appear simultaneously.

The first method processes each number individually. The second handles the whole collection in one operation. This second approach captures the essence of vectorization in machine learning.

Most machine learning tasks involve repetitive calculations on large amounts of data. Without vectorization, a program might take several minutes to train a simple model. With vectorization, the same task completes in seconds. The mathematical results stay identical, but the execution becomes much faster.

The concept of vectorization is simple: instead of processing data points one at a time, you work with entire collections of data simultaneously.

The Problem With Loops

With loops, the computer needs to perform computations one after the other on each data point, and depending on the dataset size, this creates significant performance delays. This sequential processing results in at least O(n) time complexity for basic operations, with space complexity varying based on how you store intermediate results. When loops contain complex mathematical operations, as commonly occurs in machine learning algorithms, the performance degradation becomes more pronounced.

This problem is amplified by Python's inherent loop performance characteristics compared to compiled languages like C++ or Rust. Python processes loops more slowly because it operates as an interpreted language, checking variable types at runtime for each iteration and executing bytecode rather than native machine instructions. Compiled languages like C++ convert code to optimized machine instructions beforehand, eliminating much of this per-iteration overhead.

Numpy Vectorization

NumPy is a Python library that provides functions to perform mathematical operations on large, multidimensional arrays and matrices. Under the hood, most NumPy functions are written in C, which contributes significantly to their speed. Beyond the C implementation, NumPy achieves performance through several mechanisms: contiguous memory storage that optimizes CPU cache usage, elimination of Python's dynamic typing overhead during computations, and access to low-level processor instructions like SIMD that can perform identical operations on multiple data points simultaneously.

All data points in NumPy arrays (called ndarray or n-dimensional array) must be of the same data type. This requirement exists because C, the underlying language upon which most NumPy functionality is built, requires homogeneous data types for efficient memory allocation and arithmetic operations. When all elements share the same type, the system can predict exactly how much memory each element needs and where to find it, enabling faster processing.

NumPy handles array operations by treating entire arrays as single mathematical objects. Instead of Python iterating through individual elements, NumPy passes the entire computation to optimized C code that still uses loops internally, but these C loops benefit from compiler optimizations and can make use of CPU-level instructions that Python cannot access. The library also uses broadcasting, a technique that allows operations between arrays of different shapes without explicitly creating larger arrays in memory.

Let us examine some practical examples that demonstrate the performance differences between loop-based and vectorized approaches. These examples cover common mathematical operations in machine learning, showing both the computational formulas and timing comparisons to illustrate vectorization benefits.

Example 1: Element-wise Multiplication

Element-wise multiplication pairs corresponding elements from two arrays of equal size and multiplies them together. This operation appears frequently in machine learning when applying feature weights or performing data transformations across datasets.

Mathematical Formula: Given two vectors a = [a₁, a₂, ..., aₙ] and b = [b₁, b₂, ..., bₙ], the element-wise multiplication produces: c = a ⊙ b = [a₁×b₁, a₂×b₂, ..., aₙ×bₙ]. The Python implementation is as follows:

import numpy as np
import time

# Create large arrays
size = 1000000
a = list(range(size))
b = list(range(size, 2*size))

# Non-vectorized approach
start_time = time.time()
result_loop = []
for i in range(len(a)):
    result_loop.append(a[i] * b[i])
loop_time = time.time() - start_time

# Vectorized approach
a_np = np.array(a)
b_np = np.array(b)
start_time = time.time()
result_vectorized = a_np * b_np
vectorized_time = time.time() - start_time

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vectorized_time:.4f} seconds")
print(f"Speedup: {loop_time/vectorized_time:.1f}x faster")

In this example, we used time.time() to measure execution duration for both approaches. In the non-vectorized method, we used an explicit loop to process each element individually, appending results to a new list. For the vectorized approach, we simply converted our Python lists to NumPy arrays using np.array() and applied the multiplication operator (*) directly to the entire arrays.

As we can see from the printed output below, the vectorized approach achieved a 62x performance improvement. While your specific numbers may differ based on your processor speed, available memory, system load, and array size, the vectorized approach will consistently outperform the loop-based method by significant margins.

Loop time: 0.2971 seconds
Vectorized time: 0.0048 seconds
Speedup: 62.0x faster

Example 2: Dot Product

The dot product multiplies corresponding elements from two vectors and sums the results to produce a single scalar value. This calculation forms the foundation of linear regression predictions and neural network operations in machine learning.

Mathematical Formula: Given two vectors a = [a₁, a₂, ..., aₙ] and b = [b₁, b₂, ..., bₙ], the dot product is: a · b = a₁×b₁ + a₂×b₂ + ... + aₙ×bₙ = Σᵢ₌₁ⁿ (aᵢ × bᵢ).

The Python implementation is as follows:

import numpy as np
import time

# Computing dot product of two vectors
size = 100000
a = list(range(size))
b = list(range(1, size + 1))

# Non-vectorized approach
start_time = time.time()
dot_product_loop = 0
for i in range(len(a)):
    dot_product_loop += a[i] * b[i]
loop_time = time.time() - start_time

# Vectorized approach
a_np = np.array(a)
b_np = np.array(b)
start_time = time.time()
dot_product_vectorized = np.dot(a_np, b_np)
vectorized_time = time.time() - start_time

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vectorized_time:.4f} seconds")
print(f"Speedup: {loop_time/vectorized_time:.1f}x faster")

Here, we use NumPy's np.dot() function for computing the dot product of the two NumPy arrays. Similar to the previous example, we measure timing differences between loop-based summation and NumPy's built-in function, and as seen in the output printed below, the vectorized version achieved a 100x speedup using NumPy's optimized function instead of manual Python iteration and accumulation.

Loop time: 0.0204 seconds
Vectorized time: 0.0002 seconds
Speedup: 101.8x faster

Pandas Vectorization

Pandas extends NumPy's vectorization capabilities to labeled data structures like DataFrames and Series. Unlike NumPy's focus on homogeneous arrays, pandas handles mixed data types and missing values and provides intuitive operations on structured datasets. Pandas vectorization excels in data cleaning, transformation, and analysis tasks where you need to work with real-world datasets containing different column types and potential missing values.

Pandas achieves vectorization through built-in methods that operate on entire columns or DataFrames simultaneously. The library automatically handles data alignment, missing value propagation, and type conversions during vectorized operations. This makes pandas ideal for exploratory data analysis and data preprocessing tasks where you work with tabular data.

Let us examine practical examples that demonstrate pandas vectorization compared to traditional loop approaches for common data manipulation tasks:

Example 1: Element-wise Operations Between Columns

This example demonstrates multiplying corresponding values from two columns in a dataset. This operation commonly appears when calculating totals from quantity and price columns, applying weights to scores, or computing products of related features in data analysis.

Mathematical Formula: Given two columns x₁ = [x₁₁, x₁₂, ..., x₁ₙ] and x₂ = [x₂₁, x₂₂, ..., x₂ₙ], the element-wise multiplication produces: result = [x₁₁ × x₂₁, x₁₂ × x₂₂, ..., x₁ₙ × x₂ₙ].

Here’s the Python implementation:

import pandas as pd
import numpy as np
import time

# Create sample data as basic Python structures
size = 500000
values = np.random.randint(1, 100, size)
multipliers = np.random.randint(2, 5, size)

# Non-vectorized approach (working with lists)
start_time = time.time()
results_loop = []
for i in range(len(values)):
    results_loop.append(values[i] * multipliers[i])
loop_time = time.time() - start_time

# Vectorized approach (convert to pandas)
start_time = time.time()
df = pd.DataFrame({'values': values, 'multipliers': multipliers})
df['results'] = df['values'] * df['multipliers']
vectorized_time = time.time() - start_time

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vectorized_time:.4f} seconds")
print(f"Speedup: {loop_time/vectorized_time:.1f}x faster")

Here, we created two Python lists for values and multipliers, which are the two features in our dataset. For the non-vectorized approach, we manually iterated through both lists and performed individual multiplications. For the vectorized approach, we created a pandas DataFrame with both columns and used the multiplication operator between entire columns with df['values'] * df['multipliers']. As is visible from the output below, the vectorized approach outperforms the loop-based approach.

Loop time: 0.1331 seconds
Vectorized time: 0.0034 seconds
Speedup: 38.7x faster

Example 2: Conditional Operations and Data Categorization

This example demonstrates applying conditional logic to categorize data based on value ranges, similar to creating grade assignments from numerical scores. Such operations frequently occur in data analysis when creating categorical variables, segmenting customers, or applying business rules across datasets. Here’s the Python implementation:

# Raw data as Python lists
size = 500000
scores = np.random.randint(0, 100, size)

# Non-vectorized approach (Python lists and loops)
start_time = time.time()
grades_loop = []
for score in scores:
    if score >= 90:
        grades_loop.append('Excellent')
    elif score >= 70:
        grades_loop.append('Good')
    else:
        grades_loop.append('Fair')
loop_time = time.time() - start_time

# Vectorized approach (convert to pandas and use vectorized operations)
start_time = time.time()
df = pd.DataFrame({'score': scores})
df['grade'] = pd.cut(df['score'], 
                    bins=[0, 70, 90, 100], 
                    labels=['Fair', 'Good', 'Excellent'])
vectorized_time = time.time() - start_time

print(f"Loop time: {loop_time:.4f} seconds")
print(f"Vectorized time: {vectorized_time:.4f} seconds")
print(f"Speedup: {loop_time/vectorized_time:.1f}x faster")

In the loop example, we have a condition we use to assign each grade a text string, which we store in another array. And with vectorization, we use pandas pd.cut() function with bins at 70 and 90 to do this. The function automatically applied the conditional logic to the entire column simultaneously. Similar to the previous example, we started with a Python list and converted it to a pandas DataFrame to make use of vectorized operations. The vectorized approach eliminated the need for explicit if-elif-else statements within loops. As usual, the vectorized solution shows faster speed:

Loop time: 0.1327 seconds
Vectorized time: 0.0146 seconds
Speedup: 9.1x faster

Conclusion

We've seen how vectorization changes the way we handle data processing in machine learning. Instead of writing loops that process each data point individually, we can apply operations to entire datasets simultaneously. This approach showed significant speed improvements.

We used NumPy for mathematical operations on homogeneous numerical data, and we used pandas for structured data manipulation, efficiently handling operations between columns and applying conditional logic across entire datasets.

Moving from individual data point processing to whole-collection operations requires a different way of thinking. This approach becomes more natural with practice and produces faster, cleaner code.

Your next step involves applying these techniques to your own datasets. Try converting existing loops to vectorized operations using the patterns we explored. Experiment with NumPy's broadcasting and pandas' built-in functions. Practice with vectorization will make these techniques more familiar in your machine learning work.

DEV Community