Akshaya

Posted on Mar 5 • Originally published at akshaya-dev.hashnode.dev

Efficient Parallelism in Python: A Practical Guide to concurrent.futures module

#python #backend #programming #dataengineering

Modern computing environments are built on multi-core processors, yet Python’s default execution model is single threaded due to the Global Interpreter Lock (GIL). To achieve high concurrency, the concurrent.futures module can be utilized to distribute tasks across multiple execution units.

Understanding the basics

Concurrency - It is the ability of a system to handle multiple tasks while making progress on the tasks over time. The tasks may not run simultaneously, but the system switches between tasks efficiently such that progress is made on each task.
Parallelism - It is the simultaneous execution of multiple tasks using multiple CPU cores.
Process - An instance of a running program. Each process has its own memory space & CPU resources and runs independently from other processes. Processes can communicate with each other using pipes, queues, or shared memory.
Thread - The smallest unit of execution within a process. When a process starts, a main thread is created automatically. Additional threads can be created to achieve task concurrency within the same process and they share the same memory space within that process. Also, threads are lightweight compared to processes.
Global Interpreter Lock (GIL) - A mutex (as in mutual exclusion) lock in CPython that allows only one thread to execute Python bytecode at a time.

Understanding the Bottleneck: CPU-Bound vs I/O-Bound

Before choosing an executor, we must identify the nature of our task's latency.

I/O-Bound: The program spends most of its time waiting for external responses (Network requests, Database queries, Disk Read/Write).
CPU-Bound: The program spends its time performing computations on the CPU (Mathematical calculations, Image processing, Data transformations).

Executors in `concurrent.futures`

Python provides two executors for parallel and concurrent execution.

ThreadPoolExecutor

The ThreadPoolExecutor manages a pool of worker threads and distributes tasks among them so multiple operations can run concurrently.
It is suitable for I/O-bound tasks because when one thread waits for a network or I/O response, the GIL is released, allowing another thread to start its request.
Example: Data ingestion from multiple APIs
- max_workers: We set max_workers to 4. So, the ThreadPoolExecutor uses at most 4 threads to perform the data ingestion process concurrently. In our case, each thread is assigned to one of the four APIs and performs data ingestion.
- executor.map(): Concurrent version of map() function. It applies a given function to all items in an iterable, executing the calls concurrently using a pool of worker threads.

from concurrent.futures import ThreadPoolExecutor
import time
import random

def ingest_source(source_name):
    # data ingestion and cloud storage happens here
    return f"{source_name} ingested successfully"

def concurrent_ingest(sources):
    with ThreadPoolExecutor(max_workers=4) as executor:
        results = executor.map(ingest_source, sources)
    return list(results)

if __name__ == '__main__':
    api_sources = ["customer_api", "orders_api", "inventory_api", "payments_api"]

    print("Initiating Data Ingestion")
    final_results = concurrent_ingest(api_sources)

    print(f"Data Ingestion Complete: {final_results}")

Expected Output: Since this is an I/O-bound task, we will see all four "Initiating" messages appear almost instantly. The total execution time would be determined by the single slowest API call (e.g., ~2 seconds), rather than the sum of all four calls (8+ seconds).

ProcessPoolExecutor

This creates entirely separate instances (processes) of the Python interpreter, each with its own memory space and its own GIL.
It is suitable for CPU-bound tasks because separate processes allow the program to utilize multiple CPU cores and achieve true parallelism.
Note: It can incur high memory overhead and serialization costs to move data between processes.
Example scenario: Calculating the sum of factorials for the input number

from concurrent.futures import ProcessPoolExecutor
import time
import math
import sys
sys.set_int_max_str_digits(10000)


def factorial_sum(number):
    # Heavy math: Calculating the sum of factorials for the input number
    result = sum(math.factorial(i) for i in range(1, number))
    return f"Input {number} -> Result length: {len(str(result))} digits"


def concurrent_transform(data_list):
    with ProcessPoolExecutor(max_workers=2) as executor:
        results = executor.map(factorial_sum, data_list)
    return list(results)


if __name__ == '__main__':
    data_shards = [2500, 2600, 2700, 2800]

    start_time = time.perf_counter()
    final_results = [factorial_sum(num) for num in data_shards]   
    duration = time.perf_counter() - start_time
    print(f"Time taken (without ProcessPoolExecutor): {duration:.2f} seconds\n")
    for res in final_results:
        print(res)

    process_pool_start_time = time.perf_counter()
    process_pool_final_results = concurrent_transform(data_shards)   
    process_pool_duration = time.perf_counter() - process_pool_start_time
    print(f"Time taken (with ProcessPoolExecutor): {process_pool_duration :.2f} seconds\n")
    for res in process_pool_final_results:
        print(res)

Output:

Time taken (without ProcessPoolExecutor): 1.00 seconds

Input 2500 -> Result length: 7408 digits
Input 2600 -> Result length: 7749 digits
Input 2700 -> Result length: 8091 digits
Input 2800 -> Result length: 8435 digits

Time taken (with ProcessPoolExecutor): 0.62 seconds

Input 2500 -> Result length: 7408 digits
Input 2600 -> Result length: 7749 digits
Input 2700 -> Result length: 8091 digits
Input 2800 -> Result length: 8435 digits

Comparison Table

Feature	ThreadPoolExecutor	ProcessPoolExecutor
Task Type	I/O-bound	CPU-bound
Parallelism	No (Concurrent only)	Yes (True Parallel)
Memory Usage	Low	High
Data Sharing	Easy (Shared Memory)	Hard (Serialization)
Python Version	3.2+	3.2+

Things to note

InterpreterPoolExecutor: Introduced in Python v3.14, it is a subclass of ThreadPoolExecutor where each worker thread runs in its own isolated Python interpreter. This allows for true multi-core parallelism (multiple GILs) while avoiding the heavy memory overhead and slow startup times associated with creating entirely new processes.
The __main__ Guard: When using ProcessPoolExecutor , you must wrap your entry point in if __name__ == "__main__": to prevent infinite recursion during sub-process creation.
The with statement: Always use the with statement. It ensures the threads are cleaned once block execution completes.
Worker Limits: For CPU-bound tasks, setting max_workers significantly higher than your CPU core count will decrease performance due to context-switching overhead.

Conclusion

Understanding when to use threads versus processes is essential for writing efficient Python applications. By identifying whether a task is I/O-bound or CPU-bound, we can choose the appropriate executor and significantly improve performance.

DEV Community

Efficient Parallelism in Python: A Practical Guide to concurrent.futures module

Understanding the basics

Understanding the Bottleneck: CPU-Bound vs I/O-Bound

Executors in `concurrent.futures`

ThreadPoolExecutor

ProcessPoolExecutor

Comparison Table

Things to note

Conclusion

Related Links

Top comments (0)

Understanding the basics

Understanding the Bottleneck: CPU-Bound vs I/O-Bound

Executors in concurrent.futures

ThreadPoolExecutor

ProcessPoolExecutor

Comparison Table

Things to note

Conclusion

Related Links

Executors in `concurrent.futures`