
Modern computing environments are built on multi-core processors, yet Python’s default execution model is single threaded due to the Global Interpreter Lock (GIL). To achieve high concurrency, the concurrent.futures module can be utilized to distribute tasks across multiple execution units.
Understanding the basics
Concurrency - It is the ability of a system to handle multiple tasks while making progress on the tasks over time. The tasks may not run simultaneously, but the system switches between tasks efficiently such that progress is made on each task.
Parallelism - It is the simultaneous execution of multiple tasks using multiple CPU cores.
Process - An instance of a running program. Each process has its own memory space & CPU resources and runs independently from other processes. Processes can communicate with each other using pipes, queues, or shared memory.
Thread - The smallest unit of execution within a process. When a process starts, a main thread is created automatically. Additional threads can be created to achieve task concurrency within the same process and they share the same memory space within that process. Also, threads are lightweight compared to processes.
Global Interpreter Lock (GIL) - A mutex (as in mutual exclusion) lock in CPython that allows only one thread to execute Python bytecode at a time.
Understanding the Bottleneck: CPU-Bound vs I/O-Bound
Before choosing an executor, we must identify the nature of our task's latency.
I/O-Bound: The program spends most of its time waiting for external responses (Network requests, Database queries, Disk Read/Write).
CPU-Bound: The program spends its time performing computations on the CPU (Mathematical calculations, Image processing, Data transformations).
Executors in concurrent.futures
Python provides two executors for parallel and concurrent execution.
ThreadPoolExecutor
The
ThreadPoolExecutormanages a pool of worker threads and distributes tasks among them so multiple operations can run concurrently.It is suitable for I/O-bound tasks because when one thread waits for a network or I/O response, the GIL is released, allowing another thread to start its request.
-
Example: Data ingestion from multiple APIs
-
max_workers: We setmax_workersto 4. So, theThreadPoolExecutoruses at most 4 threads to perform the data ingestion process concurrently. In our case, each thread is assigned to one of the four APIs and performs data ingestion. -
executor.map(): Concurrent version ofmap()function. It applies a given function to all items in an iterable, executing the calls concurrently using a pool of worker threads.
-
from concurrent.futures import ThreadPoolExecutor
import time
import random
def ingest_source(source_name):
# data ingestion and cloud storage happens here
return f"{source_name} ingested successfully"
def concurrent_ingest(sources):
with ThreadPoolExecutor(max_workers=4) as executor:
results = executor.map(ingest_source, sources)
return list(results)
if __name__ == '__main__':
api_sources = ["customer_api", "orders_api", "inventory_api", "payments_api"]
print("Initiating Data Ingestion")
final_results = concurrent_ingest(api_sources)
print(f"Data Ingestion Complete: {final_results}")
- Expected Output: Since this is an I/O-bound task, we will see all four "Initiating" messages appear almost instantly. The total execution time would be determined by the single slowest API call (e.g., ~2 seconds), rather than the sum of all four calls (8+ seconds).
ProcessPoolExecutor
This creates entirely separate instances (processes) of the Python interpreter, each with its own memory space and its own GIL.
It is suitable for CPU-bound tasks because separate processes allow the program to utilize multiple CPU cores and achieve true parallelism.
Note: It can incur high memory overhead and serialization costs to move data between processes.
Example scenario: Calculating the sum of factorials for the input number
from concurrent.futures import ProcessPoolExecutor
import time
import math
import sys
sys.set_int_max_str_digits(10000)
def factorial_sum(number):
# Heavy math: Calculating the sum of factorials for the input number
result = sum(math.factorial(i) for i in range(1, number))
return f"Input {number} -> Result length: {len(str(result))} digits"
def concurrent_transform(data_list):
with ProcessPoolExecutor(max_workers=2) as executor:
results = executor.map(factorial_sum, data_list)
return list(results)
if __name__ == '__main__':
data_shards = [2500, 2600, 2700, 2800]
start_time = time.perf_counter()
final_results = [factorial_sum(num) for num in data_shards]
duration = time.perf_counter() - start_time
print(f"Time taken (without ProcessPoolExecutor): {duration:.2f} seconds\n")
for res in final_results:
print(res)
process_pool_start_time = time.perf_counter()
process_pool_final_results = concurrent_transform(data_shards)
process_pool_duration = time.perf_counter() - process_pool_start_time
print(f"Time taken (with ProcessPoolExecutor): {process_pool_duration :.2f} seconds\n")
for res in process_pool_final_results:
print(res)
Output:
Time taken (without ProcessPoolExecutor): 1.00 seconds
Input 2500 -> Result length: 7408 digits
Input 2600 -> Result length: 7749 digits
Input 2700 -> Result length: 8091 digits
Input 2800 -> Result length: 8435 digits
Time taken (with ProcessPoolExecutor): 0.62 seconds
Input 2500 -> Result length: 7408 digits
Input 2600 -> Result length: 7749 digits
Input 2700 -> Result length: 8091 digits
Input 2800 -> Result length: 8435 digits
Comparison Table
Feature |
ThreadPoolExecutor |
ProcessPoolExecutor |
Task Type |
I/O-bound |
CPU-bound |
Parallelism |
No (Concurrent only) |
Yes (True Parallel) |
Memory Usage |
Low |
High |
Data Sharing |
Easy (Shared Memory) |
Hard (Serialization) |
Python Version |
3.2+ |
3.2+ |
Things to note
InterpreterPoolExecutor: Introduced in Python v3.14, it is a subclass of
ThreadPoolExecutorwhere each worker thread runs in its own isolated Python interpreter. This allows for true multi-core parallelism (multiple GILs) while avoiding the heavy memory overhead and slow startup times associated with creating entirely new processes.The
__main__Guard: When usingProcessPoolExecutor, you must wrap your entry point inif __name__ == "__main__":to prevent infinite recursion during sub-process creation.The
withstatement: Always use thewithstatement. It ensures the threads are cleaned once block execution completes.Worker Limits: For CPU-bound tasks, setting
max_workerssignificantly higher than your CPU core count will decrease performance due to context-switching overhead.
Conclusion
Understanding when to use threads versus processes is essential for writing efficient Python applications. By identifying whether a task is I/O-bound or CPU-bound, we can choose the appropriate executor and significantly improve performance.
Top comments (0)