Speeding Up Your Python Programs with Concurrency

AZ — Wed, 01 Apr 2026 22:18:46 +0000

What Is Concurrency?

At its core, concurrency means a program can juggle multiple sequences of work. In Python, these sequences go by different names — threads, tasks, and processes — but they all share the same basic idea: each one represents a line of execution that can be paused and resumed.

The important distinction is that threads and asynchronous tasks run on a single processor, switching between each other cleverly rather than truly running side by side. Processes, on the other hand, can run on separate CPU cores simultaneously — that's true parallelism.

Python offers three main tools for concurrency:

I/O-Bound vs CPU-Bound Problems

Before choosing a concurrency approach, it's important to understand what kind of problem you're solving.

I/O-bound problems are those where your program spends most of its time waiting — for a network response, a file to load, or a database query to return. The CPU sits idle during these waits, so overlapping them with other work can bring big gains.

CPU-bound problems are the opposite — the bottleneck is the processor itself, working flat out on heavy computation like number crunching or data processing.
Speeding Up I/O-Bound Tasks

The Synchronous Approach

A simple synchronous script downloads web pages one by one, waiting for each to finish before moving on. It's easy to write and understand, but slow — all that waiting time is wasted.

`# sync_downloader.py
import time
import requests

def fetch_page(url, session):
with session.get(url) as resp:
print(f"Fetched {len(resp.content)} bytes from {url}")

def fetch_all(urls):
with requests.Session() as session:
for url in urls:
fetch_page(url, session)

def main():
targets = [
"https://www.jython.org",
"http://olympus.realpython.org/dice",
] * 80
t0 = time.perf_counter()
fetch_all(targets)
elapsed = time.perf_counter() - t0
print(f"Fetched {len(targets)} pages in {elapsed:.2f}s")

if name == "main":
main()`

On a typical connection, this might take around 14 seconds for 160 pages. Fine for a one-off script, but painful if you run it regularly.

Using Threads

The ThreadPoolExecutor from concurrent.futures makes it easy to distribute work across multiple threads. With a small pool, several downloads can run at the same time.

One key requirement: requests.Session is not thread-safe, so each thread must maintain its own session using threading.local().

`# threaded_downloader.py
import threading
import time
from concurrent.futures import ThreadPoolExecutor
import requests

_thread_local = threading.local()

def get_session():
if not hasattr(_thread_local, "session"):
_thread_local.session = requests.Session()
return _thread_local.session

def fetch_page(url):
session = get_session()
with session.get(url) as resp:
print(f"Fetched {len(resp.content)} bytes from {url}")

def fetch_all(urls):
with ThreadPoolExecutor(max_workers=5) as pool:
pool.map(fetch_page, urls)

if name == "main":
main()`

This typically completes in around 3 seconds — roughly 4–5x faster than the synchronous version. The thread pool keeps a fixed number of workers alive and reuses them across all tasks, avoiding the overhead of creating a fresh thread for every request.

Using asyncio

The asyncio approach uses a single-threaded event loop with coroutines — lightweight, pausable functions. When a coroutine hits an await, it hands control back to the loop, which immediately starts another task. You'll need aiohttp instead of requests since the latter doesn't support async:

pip install aiohttp

`# async_downloader.py
import asyncio
import time
import aiohttp

async def fetch_page(url, session):
async with session.get(url) as resp:
data = await resp.read()
print(f"Fetched {len(data)} bytes from {url}")

async def fetch_all(urls):
async with aiohttp.ClientSession() as session:
jobs = [fetch_page(url, session) for url in urls]
await asyncio.gather(*jobs, return_exceptions=True)

async def main():
targets = [
"https://www.jython.org",
"http://olympus.realpython.org/dice",
] * 80
t0 = time.perf_counter()
await fetch_all(targets)
elapsed = time.perf_counter() - t0
print(f"Fetched {len(targets)} pages in {elapsed:.2f}s")

if name == "main":
asyncio.run(main())`
This version can finish in under 0.5 seconds — over 30x faster than the synchronous version and noticeably quicker than threads. Because all tasks share a single session on one thread, there are no thread-safety concerns. The trade-off is that any blocking call inside a coroutine will stall the entire event loop.

Multiprocessing for I/O (Not Recommended)

You can technically use ProcessPoolExecutor for I/O-bound work, but it's generally a poor fit. Launching separate Python processes adds significant startup overhead that outweighs any benefit when the real bottleneck is network latency. It ends up slower than the threaded version.
Speeding Up CPU-Bound Tasks

For this section, a recursive Fibonacci function serves as a stand-in for any heavy computation:

def fib(n): return n if n < 2 else fib(n - 2) + fib(n - 1)
This grows in complexity very quickly, making it a useful placeholder for real CPU-intensive work.

Synchronous Version

`# sync_cpu.py
import time

def fib(n):
return n if n < 2 else fib(n - 2) + fib(n - 1)

def main():
t0 = time.perf_counter()
for _ in range(20):
fib(35)
elapsed = time.perf_counter() - t0
print(f"Completed in {elapsed:.2f}s")

if name == "main":
main()`
Running this takes around 35 seconds. All work happens on a single CPU core.

Why Threads and asyncio Won't Help Here

Because of Python's Global Interpreter Lock (GIL), only one thread can execute Python code at a time. Adding threads to a CPU-bound task doesn't parallelise anything — it just adds overhead. The threaded version runs in roughly 40 seconds, slightly slower than the synchronous one.

The async version fares even worse at around 86 seconds, because the overhead of suspending and resuming coroutines at every single await compounds massively over millions of recursive calls.

Multiprocessing — The Right Tool for CPU Work

ProcessPoolExecutor launches a separate Python interpreter for each CPU core, bypassing the GIL entirely. Each process runs independently and in true parallel:

`# parallel_cpu.py
import time
from concurrent.futures import ProcessPoolExecutor

def fib(n):
return n if n < 2 else fib(n - 2) + fib(n - 1)

def main():
t0 = time.perf_counter()
with ProcessPoolExecutor() as pool:
pool.map(fib, [35] * 20)
elapsed = time.perf_counter() - t0
print(f"Completed in {elapsed:.2f}s")

if name == "main":
main()`

On a four-core machine this finishes in about 10 seconds — less than a third of the synchronous time. Notice the code is almost identical to the threaded version; you only swapped ThreadPoolExecutor for ProcessPoolExecutor.

One thing to keep in mind: since each process has its own memory space, shared objects like database connections or sessions need to be initialised per-process using the initializer parameter.
Choosing the Right Tool

Here's a simple decision framework:

Step 1 — Don't add concurrency prematurely. Measure first and optimise only where it's actually slow.

Step 2 — Identify whether your bottleneck is I/O or CPU.

Step 3 — Pick the right tool:

Summary

Concurrency is a powerful tool, but it's not free — it adds complexity and potential for subtle bugs. Applied to the right problem with the right tool, though, it can turn a 35-second script into a 10-second one, or a 14-second downloader into something that finishes in half a second. Understand your bottleneck first, then reach for the appropriate model.

DEV Community: AZ

Speeding Up Your Python Programs with Concurrency