8 Python Concurrency Techniques That Transform Slow Code Into High-Performance Applications

#programming #devto #python #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Making Python programs do multiple things at once used to feel like a puzzle. I would write code that worked perfectly but then just... waited. It waited for files to load, for websites to respond, for calculations to finish. The computer's power was sitting idle. I learned that writing faster code isn't just about better algorithms; it's about doing more at the same time. This is the world of concurrency and parallelism.

Let's start with a simple picture. Imagine a coffee shop with one barista. That's a traditional, single-threaded program. One customer orders, the barista makes the coffee, delivers it, and only then takes the next order. If someone orders a complex pour-over, everyone else waits. This is slow and inefficient. Now, imagine the barista starts the coffee machine, and while it's brewing, they take the next order and warm some milk. They are not doing two physical tasks at the same instant, but they are making progress on multiple tasks by switching between them during wait times. This is concurrency – managing multiple tasks that are in progress, overlapping in time.

Now, imagine the shop hires a second barista. Now, two customers can literally have their coffee made at the exact same time. This is parallelism – actually doing multiple things simultaneously. In computing, this usually means using more than one CPU core.

Python gives us tools for both approaches. Choosing the right one depends on what your program is waiting for. Is it waiting for external things, like disk reads or network calls? That's I/O-bound. Or is it grinding the CPU with heavy math? That's CPU-bound. The wrong tool will make your life difficult. I learned this the hard way.

The first tool is threading. Think of threads as multiple sequences of instructions within a single program that can take turns running. They share the same memory space, which is both useful and dangerous. It's useful because passing data is easy. It's dangerous because if two threads try to change the same variable at once, chaos ensuns.

Threads are perfect for I/O-bound tasks. For example, if you need to download 100 web pages, a single thread would download them one by one, spending most of its time waiting for the network. With threading, you can start many downloads at once. While one thread waits for a server to respond, another can process data that already arrived.

import threading
import time
import requests

def download_page(url, results):
    try:
        response = requests.get(url, timeout=5)
        results.append((url, len(response.text)))
        print(f"Downloaded {url}")
    except Exception as e:
        results.append((url, f"Error: {e}"))

urls = [
    'https://httpbin.org/delay/1',
    'https://httpbin.org/delay/2',
    'https://httpbin.org/status/200',
    'https://httpbin.org/status/404'
]

start_time = time.time()
results = []
threads = []

for url in urls:
    thread = threading.Thread(target=download_page, args=(url, results))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()

end_time = time.time()
print(f"Downloaded {len(urls)} pages in {end_time - start_time:.2f} seconds")
for url, length in results:
    print(f"  {url}: {length}")

You'll notice the total time is close to the longest individual download, not the sum of all of them. That's the power of concurrency. However, Python has a quirk called the Global Interpreter Lock (GIL). In simple terms, the GIL prevents multiple threads from executing Python bytecode at the same exact moment, even on a multi-core CPU. This makes threads less effective for pure, number-crunching CPU tasks. For that, we need a different tool.

The second tool is multiprocessing. This is where we get true parallelism. Instead of threads, we use separate processes. Each process has its own Python interpreter, its own memory space, and its own GIL. This means they can run on different CPU cores simultaneously. The cost is that sharing data is more complicated; you can't just change a variable and have another process see it.

Use multiprocessing when you have a lot of computation to do, like resizing hundreds of images, running complex simulations, or factoring large numbers.

from multiprocessing import Pool, cpu_count
from PIL import Image
import os

def resize_image(image_info):
    """Resize a single image and save it."""
    filename, size = image_info
    try:
        img = Image.open(filename)
        img.thumbnail(size)
        new_name = f"resized_{os.path.basename(filename)}"
        img.save(new_name)
        return f"Resized {filename} -> {new_name}"
    except Exception as e:
        return f"Failed {filename}: {e}"

# List of images to process
image_files = ['photo1.jpg', 'photo2.png', 'photo3.bmp']
target_size = (800, 600)  # New dimensions
tasks = [(f, target_size) for f in image_files]

if __name__ == '__main__':  # This line is crucial for multiprocessing on Windows
    print(f"System has {cpu_count()} CPU cores")

    with Pool(processes=cpu_count()) as pool:
        # Distribute tasks across processes
        results = pool.map(resize_image, tasks)

    for result in results:
        print(result)

The if __name__ == '__main__': guard is essential on systems like Windows. It prevents the child processes from re-executing the main module and creating an infinite loop of new processes. Multiprocessing is powerful but has overhead. Creating a process is slower than creating a thread, and moving data between them takes time. It's a heavyweight tool for heavyweight tasks.

The third technique is asynchronous programming with asyncio. This is different from threading and multiprocessing. Instead of relying on the operating system to switch between tasks, we write our code in a cooperative style. We explicitly say, "I'm going to wait for this network packet now, so you can run some other code while I do." It's like the barista who knows exactly when they have to wait and uses that time productively.

asyncio is fantastic for building network servers, chat applications, or web scrapers that handle thousands of connections at once with very little resource use. The key is the async and await keywords.

import asyncio
import aiohttp

async def check_website(session, url):
    """Check if a website is online and get its status."""
    try:
        async with session.get(url) as response:
            return url, response.status
    except Exception as e:
        return url, f"Connection failed: {e}"

async def main():
    urls = [
        'https://www.google.com',
        'https://www.github.com',
        'https://www.example.com',
        'https://nonexistent.website.xyz'  # This will fail
    ]

    # Create a single shared session for all requests
    async with aiohttp.ClientSession() as session:
        # Create a list of tasks (coroutines)
        tasks = []
        for url in urls:
            task = asyncio.create_task(check_website(session, url))
            tasks.append(task)

        # Wait for all tasks to complete and gather results
        results = await asyncio.gather(*tasks)

        for url, status in results:
            print(f"{url} -> {status}")

# Run the async event loop
asyncio.run(main())

When you run this, all the website checks are initiated almost at once. The program doesn't block waiting for the first one to finish before starting the second. It's a single thread managing all these overlapping network operations. The efficiency gain is enormous compared to doing them sequentially or even with a large thread pool.

When you have threads or processes sharing data, you need coordination. The fourth set of techniques is synchronization primitives. These are the traffic lights and stop signs of the concurrent world. Without them, you get race conditions, where the outcome depends on which thread gets to a piece of data first. It's unpredictable and a nightmare to debug.

The most basic tool is a Lock. It ensures only one thread can enter a "critical section" of code at a time.

import threading
import time

class BankAccount:
    def __init__(self, initial_balance):
        self.balance = initial_balance
        self.lock = threading.Lock()

    def unsafe_withdraw(self, amount):
        """This can lead to incorrect balance."""
        old_balance = self.balance
        # Simulate a brief pause between reading and writing
        time.sleep(0.001)
        new_balance = old_balance - amount
        self.balance = new_balance
        return new_balance

    def safe_withdraw(self, amount):
        """Using a lock protects the operation."""
        with self.lock:
            old_balance = self.balance
            time.sleep(0.001)  # Pause still here, but safe
            new_balance = old_balance - amount
            self.balance = new_balance
            return new_balance

# Test the unsafe version
account = BankAccount(1000)

def unsafe_customer():
    for _ in range(100):
        account.unsafe_withdraw(1)

threads = [threading.Thread(target=unsafe_customer) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Unsafe final balance (should be 0): {account.balance}")

# Reset and test the safe version
account.balance = 1000

def safe_customer():
    for _ in range(100):
        account.safe_withdraw(1)

threads = [threading.Thread(target=safe_customer) for _ in range(10)]
for t in threads:
    t.start()
for t in threads:
    t.join()

print(f"Safe final balance (should be 0): {account.balance}")

Run this a few times. The unsafe balance will almost never be 0. Two threads read the balance (say, 100), both subtract 1, and both write 99 back. Two withdrawals occurred, but the balance only decreased by 1. The lock prevents this by making the read-modify-write sequence atomic.

Other useful tools include Semaphore (allowing a fixed number of threads into a section, like a limited resource pool), Condition (for complex waiting and notification), and Queue (a thread-safe way to pass data between producers and consumers).

The fifth tool is the concurrent.futures module. This is one of my favorites because it provides a clean, high-level interface. It abstracts away many details of thread and process pools. You just say, "Here's a function and a list of inputs. Go process them with this many workers." It returns Future objects, which are placeholders for results that will arrive later.

from concurrent.futures import ThreadPoolExecutor, ProcessPoolExecutor, as_completed
import time

# A CPU-intensive task (good for processes)
def calculate_power(n, exponent):
    time.sleep(0.1)  # Simulate some work
    return n ** exponent

# An I/O-bound task (good for threads)
def fetch_data(query):
    time.sleep(0.5)  # Simulate network delay
    return f"Results for '{query}'"

def demo_threads():
    queries = ["python", "concurrency", "future", "demo", "test"]
    print("Using ThreadPoolExecutor for I/O tasks:")

    with ThreadPoolExecutor(max_workers=3) as executor:
        # Submit tasks and get Future objects
        future_to_query = {executor.submit(fetch_data, q): q for q in queries}

        # Process results as they complete
        for future in as_completed(future_to_query):
            query = future_to_query[future]
            try:
                data = future.result()
                print(f"  {query}: {data}")
            except Exception as e:
                print(f"  {query} generated an error: {e}")

def demo_processes():
    numbers = [(2, 31), (5, 20), (10, 10), (3, 15)]
    print("\nUsing ProcessPoolExecutor for CPU tasks:")

    with ProcessPoolExecutor(max_workers=2) as executor:
        results = executor.map(lambda x: calculate_power(*x), numbers)

        for (base, exp), result in zip(numbers, results):
            print(f"  {base}^{exp} = {result}")

if __name__ == '__main__':
    demo_threads()
    demo_processes()

The map function here is very similar to the built-in map, but it runs in parallel. The as_completed iterator gives you results in the order they finish, not the order they were submitted, which is great for responsiveness.

The sixth pattern is the Actor model. This is a different way of thinking. Instead of sharing memory and using locks, each "actor" is an independent entity with its own private state. Actors communicate by sending immutable messages to each other's mailboxes. This eliminates shared state problems by design. While Python doesn't have a built-in actor system, we can build a simple version with threads and queues.

import threading
import queue
import time

class PrinterActor(threading.Thread):
    def __init__(self):
        super().__init__()
        self.mailbox = queue.Queue()
        self.shutdown_flag = False

    def send(self, message):
        self.mailbox.put(message)

    def run(self):
        while not self.shutdown_flag:
            try:
                # Wait for a message for up to 1 second
                message = self.mailbox.get(timeout=1)
                self.handle_message(message)
                self.mailbox.task_done()
            except queue.Empty:
                # Timeout, check shutdown flag again
                continue

    def handle_message(self, message):
        # This is where the actor's behavior is defined
        if message['command'] == 'print':
            print(f"Actor says: {message['text']}")
            time.sleep(0.5)  # Simulate slow printing
        elif message['command'] == 'shutdown':
            self.shutdown_flag = True

    def stop(self):
        self.shutdown_flag = True
        self.join()

# Using the actor
actor = PrinterActor()
actor.start()

# Send messages asynchronously
actor.send({'command': 'print', 'text': 'Hello from the main thread!'})
actor.send({'command': 'print', 'text': 'This is the second message.'})
actor.send({'command': 'print', 'text': 'Processing...'})

# Give the actor time to process
time.sleep(0.1)
actor.send({'command': 'shutdown'})

actor.stop()
print("Main thread: Actor has been shut down.")

The main thread never directly tells the actor how to print. It just sends a message. The actor decides what to do with it, in its own time, in its own thread. This separation of concerns makes complex systems easier to reason about.

The seventh technique involves asynchronous generators and pipelines. With asyncio, you can create data processing pipelines that handle streams efficiently, controlling the flow so that a fast producer doesn't overwhelm a slow consumer. This is called backpressure.

import asyncio
import random

async def number_generator(count, delay):
    """Produces numbers."""
    for i in range(count):
        await asyncio.sleep(delay)  # Production delay
        yield i
        print(f"  Generated: {i}")

async def slow_square(number_stream):
    """Squares numbers, but slowly."""
    async for num in number_stream:
        await asyncio.sleep(0.3)  # Processing is slower than generation
        squared = num * num
        yield squared
        print(f"  Squared: {num} -> {squared}")

async def main():
    print("Starting async pipeline: Generate -> Square -> Print")

    # Create the pipeline
    gen = number_generator(5, 0.1)
    processor = slow_square(gen)

    # Consume the final results
    async for result in processor:
        print(f"Result: {result}")

    print("Pipeline finished.")

asyncio.run(main())

You'll see that the generator has to wait for the squaring step to finish before it can produce the next number. This automatic coordination prevents a memory blow-up from an unbounded queue and balances the workload.

The eighth and final technique for large-scale applications is using distributed task queues like Celery. When your workload outgrows a single machine, you need to distribute it across a cluster. Celery uses a message broker (like Redis or RabbitMQ) to distribute tasks to worker processes, which can be on different machines.

You define tasks as plain Python functions. A client (like your web application) sends a task message to the broker. A worker, which might be on another computer entirely, picks it up, executes it, and can optionally store the result. This is how many web applications handle sending emails, processing videos, or generating reports without making the user wait.

# File: tasks.py (This is the worker's code)
# pip install celery redis
from celery import Celery
import time

# Create a Celery application, specifying the message broker and result backend
app = Celery('tasks',
             broker='redis://localhost:6379/0',  # Redis server
             backend='redis://localhost:6379/0')

@app.task
def process_video(video_path):
    """Simulate a long video processing task."""
    print(f"Starting to process {video_path}")
    time.sleep(10)  # Simulate 10 seconds of work
    output_path = f"processed_{video_path}"
    print(f"Finished processing {video_path}")
    return {"input": video_path, "output": output_path, "status": "success"}

# To run a worker, you would execute in terminal:
# celery -A tasks worker --loglevel=info

# File: client.py (This is the code that submits tasks)
from tasks import process_video

# This doesn't run the function; it sends a message.
task_result = process_video.delay("home_video.mp4")

print(f"Task ID: {task_result.id}")
print("Task submitted. The web app can continue responding to users.")

# Later, you could check the result (blocking call for demo):
# if task_result.ready():
#     print(f"Result: {task_result.get(timeout=1)}")

This architecture is incredibly robust. Workers can crash and be restarted, tasks can be retried, and you can scale horizontally by just adding more worker machines.

Choosing the right tool comes down to your problem. Here's a mental checklist I use:

Is it mostly waiting for I/O (network, disk)? Use asyncio or ThreadPoolExecutor.
Is it heavy computation that uses the CPU? Use ProcessPoolExecutor or multiprocessing.
Do I need fine-grained control over many concurrent I/O operations? Use asyncio.
Is it a simple script to speed up a list of tasks? Use concurrent.futures.
Are threads stepping on each other's data? Use synchronization primitives (Lock, Queue).
Is the system becoming complex with shared state? Consider the Actor model for clarity.
Does the workload need to scale beyond one machine? Look at distributed task queues like Celery.

Start simple. Often, a ThreadPoolExecutor or ProcessPoolExecutor is all you need. Introduce complexity only when you measure a benefit. Concurrency adds overhead—communication between threads or processes isn't free. Always profile your code to see if the parallel version is actually faster.

The goal isn't to use the most advanced technique. The goal is to write programs that are efficient, responsive, and correct. By understanding these eight techniques, you have a toolbox to tackle a wide range of performance bottlenecks. You can stop writing code that just waits and start writing code that gets things done.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!