Aaron Rose

Posted on Oct 30

Living with the GIL: Strategies for Concurrent Python

#python #coding #programming #softwaredevelopment

Timothy and Margaret walked through the library's quiet reading room toward the small coffee shop in the corner. The afternoon sun streamed through the tall windows, and only a few patrons remained, absorbed in their books.

"So if the GIL prevents my Python code from running in parallel," Timothy said, pulling out his laptop as they sat down, "how do people actually do parallel computation in Python? I mean, machine learning, data science—those need real parallelism, right?"

Margaret smiled. "They do. And Python has three main strategies for concurrency and parallelism. We've already seen threading work for I/O. Now let me show you the others."

She opened her laptop next to his. "Let's start with the most obvious solution: if one Python interpreter can only run one thread at a time, what if we use multiple Python interpreters?"

Multiple Interpreters: The Multiprocessing Solution

"Wait," Timothy said. "Multiple interpreters? Like running Python multiple times?"

"Exactly." Margaret typed:

from multiprocessing import Process
import time

def cpu_intensive_task(name):
    """A CPU-bound task that takes time"""
    count = 0
    for i in range(50_000_000):
        count += i
    print(f"{name} finished: {count}")

# Sequential execution
start = time.time()
cpu_intensive_task("Task 1")
cpu_intensive_task("Task 2")
sequential_time = time.time() - start
print(f"Sequential: {sequential_time:.2f}s")

# Parallel execution with multiprocessing
start = time.time()
p1 = Process(target=cpu_intensive_task, args=("Task 1",))
p2 = Process(target=cpu_intensive_task, args=("Task 2",))
p1.start()
p2.start()
p1.join()
p2.join()
parallel_time = time.time() - start
print(f"Parallel: {parallel_time:.2f}s")
print(f"Speedup: {sequential_time / parallel_time:.2f}x")

"Try running this," Margaret said.

Timothy ran the code and watched the output:

Task 1 finished: 1249999975000000
Task 2 finished: 1249999975000000
Sequential: 4.32s
Task 1 finished: 1249999975000000
Task 2 finished: 1249999975000000
Parallel: 2.18s
Speedup: 1.98x

His eyes widened. "Nearly 2x speedup! It actually worked!"

"Because each Process is a completely separate Python interpreter," Margaret explained. "Separate memory space, separate GIL. They don't share the lock because they're literally different programs running at the same time."

She drew a quick diagram:

Threading (one interpreter, one GIL):
┌─────────────────────────────────┐
│   Python Interpreter            │
│   ┌───────────────────────┐     │
│   │  GIL (one lock)       │     │
│   │  - Thread 1           │     │
│   │  - Thread 2           │     │
│   │  - Thread 3           │     │
│   └───────────────────────┘     │
└─────────────────────────────────┘

Multiprocessing (multiple interpreters, multiple GILs):
┌──────────────────┐  ┌──────────────────┐  ┌──────────────────┐
│  Process 1       │  │  Process 2       │  │  Process 3       │
│  ┌────────────┐  │  │  ┌────────────┐  │  │  ┌────────────┐  │
│  │ Own GIL    │  │  │  │ Own GIL    │  │  │  │ Own GIL    │  │
│  └────────────┘  │  │  └────────────┘  │  │  └────────────┘  │
└──────────────────┘  └──────────────────┘  └──────────────────┘
    True parallel execution across cores

The Process Pool Pattern

Timothy typed something on his laptop. "So if I have four CPU cores, I could run four processes?"

"You could," Margaret said, "but there's a better pattern. Let me show you the Process Pool."

She typed a new example:

from multiprocessing import Pool
import time

def process_chunk(chunk_id):
    """Process a chunk of data"""
    total = 0
    for i in range(10_000_000):
        total += i * chunk_id
    return total

# Without pool - manual process management
start = time.time()
from multiprocessing import Process, Queue

def worker(chunk_id, queue):
    result = process_chunk(chunk_id)
    queue.put(result)

queue = Queue()
processes = []
for i in range(4):  # Create 4 processes
    p = Process(target=worker, args=(i, queue))
    p.start()
    processes.append(p)

for p in processes:
    p.join()

results = [queue.get() for _ in range(4)]
manual_time = time.time() - start
print(f"Manual (4 tasks, 4 processes): {manual_time:.2f}s")

# With Pool - automatic management
start = time.time()
with Pool(processes=4) as pool:
    results = pool.map(process_chunk, range(4))
pool_time = time.time() - start
print(f"Pool (4 tasks, 4 processes): {pool_time:.2f}s")

Timothy ran it:

Manual (4 tasks, 4 processes): 0.95s
Pool (4 tasks, 4 processes): 0.93s

"About the same speed," Timothy observed, "but the Pool code is way cleaner."

"Exactly," Margaret said. "The real power of Pool isn't raw speed—it's simplicity and worker reuse. Watch what happens when we have more tasks than workers."

She modified the code:

# 8 tasks, but only 4 worker processes
with Pool(processes=4) as pool:
    results = pool.map(process_chunk, range(8))

"The Pool automatically distributes those 8 tasks across 4 worker processes, reusing each worker for multiple tasks. You don't create and destroy processes for each task. Much more efficient and way easier to code."

The Cost of Multiprocessing

Timothy looked thoughtful. "This seems perfect. Why not just use multiprocessing for everything?"

"Great question. Try this experiment," Margaret said. She typed:

from multiprocessing import Pool
import time

def tiny_task(x):
    """A very small task"""
    return x * 2

# Process a lot of tiny tasks
data = list(range(10000))

# Sequential
start = time.time()
results = [tiny_task(x) for x in data]
sequential_time = time.time() - start

# Multiprocessing
start = time.time()
with Pool(processes=4) as pool:
    results = pool.map(tiny_task, data)
parallel_time = time.time() - start

print(f"Sequential: {sequential_time:.4f}s")
print(f"Parallel: {parallel_time:.4f}s")
print(f"Speedup: {sequential_time / parallel_time:.2f}x")

Timothy ran it and frowned at the output:

Sequential: 0.0008s
Parallel: 0.1234s
Speedup: 0.01x

"The parallel version is 100x slower!"

"Because creating processes and moving data between them is expensive," Margaret explained. "Each process needs its own Python interpreter, its own memory space. When you call pool.map(), Python has to:"

She counted on her fingers:

"One, serialize your data using pickle. Two, send it to the worker process. Three, deserialize it there. Four, run the function. Five, serialize the result. Six, send it back. Seven, deserialize it in the main process."

"That's a lot of overhead," Timothy said.

"For tiny tasks, that overhead dominates. Multiprocessing only makes sense when your tasks are large enough that the parallel speedup outweighs the communication cost."

She added, "Oh, and not everything can be pickled. Lambda functions, local classes, open file handles—those can't be passed to worker processes. You need to use regular functions and simple data types."

A Different Kind of Concurrency: Async/Await

Margaret ordered two coffees from the barista and turned back to Timothy. "Now let me show you something completely different. Remember how threading helped with I/O because the GIL gets released?"

"Right," Timothy said. "Network requests, file operations."

"There's another way to handle I/O concurrency that doesn't use threads at all. It's called async/await, and it's based on cooperative multitasking."

She typed a new example:

import asyncio
import aiohttp
import time

# First, the regular blocking version with requests
import requests

def fetch_blocking(url):
    response = requests.get(url)
    return len(response.content)

urls = [
    "https://python.org",
    "https://github.com",
    "https://stackoverflow.com",
    "https://pypi.org",
    "https://docs.python.org",
]

start = time.time()
for url in urls:
    fetch_blocking(url)
blocking_time = time.time() - start
print(f"Blocking: {blocking_time:.2f}s")

# Now with async/await
# Note: We use aiohttp instead of requests because requests blocks
# the event loop. Async requires libraries built for async.
async def fetch_async(session, url):
    async with session.get(url) as response:
        content = await response.read()
        return len(content)

async def fetch_all():
    async with aiohttp.ClientSession() as session:
        tasks = [fetch_async(session, url) for url in urls]
        results = await asyncio.gather(*tasks)
        return results

start = time.time()
results = asyncio.run(fetch_all())
async_time = time.time() - start
print(f"Async: {async_time:.2f}s")
print(f"Speedup: {blocking_time / async_time:.2f}x")

Timothy ran the code:

Blocking: 2.45s
Async: 0.51s
Speedup: 4.80x

"5x faster!" Timothy exclaimed. "But how? Is this using threads?"

"No threads at all," Margaret said. "Just one thread, actually. Let me show you what's happening."

She drew another diagram:

Traditional blocking I/O:
Task 1: [Request]──wait──[Response] [Process]
Task 2:                              [Request]──wait──[Response] [Process]
Task 3:                                                           [Request]──wait──[Response]
        └──────────────────────────────────────────────────────────────────────────────────┘
                                    Total time: sum of all waits

Async/await:
Task 1: [Request]──wait──[Response] [Process]
Task 2: [Request]──wait──────────────[Response] [Process]
Task 3: [Request]──wait──[Response]─────────────[Process]
        └───────────────────────────────┘
            Total time: longest wait

"While one task is waiting for I/O," Margaret explained, "the event loop switches to another task. They all run in the same thread, taking turns, but they don't wait for each other. That's cooperative multitasking."

How Async Really Works

Timothy looked puzzled. "But how does it know when to switch tasks?"

"The await keyword is the signal," Margaret said. "When you write await response.read(), you're telling Python: 'I'm about to wait for something. Feel free to run other tasks while I wait.'"

She typed a clearer example:

import asyncio
import time

async def task(name, delay):
    print(f"{name}: Starting")
    await asyncio.sleep(delay)  # Cooperative yield point
    print(f"{name}: Finished after {delay}s")
    return f"{name} result"

async def main():
    print("Starting three tasks...")
    results = await asyncio.gather(
        task("Task 1", 2),
        task("Task 2", 1),
        task("Task 3", 3)
    )
    return results

start = time.time()
results = asyncio.run(main())
print(f"Total time: {time.time() - start:.2f}s")
print(f"Results: {results}")

Timothy ran it:

Starting three tasks...
Task 1: Starting
Task 2: Starting
Task 3: Starting
Task 2: Finished after 1s
Task 1: Finished after 2s
Task 3: Finished after 3s
Total time: 3.00s
Results: ['Task 1 result', 'Task 2 result', 'Task 3 result']

"They all started at once," Timothy observed, "and the total time was 3 seconds, not 6."

"Because they ran concurrently. Task 2 finished first even though it started at the same time as Task 1. The event loop was juggling all three, and they each waited independently."

"But this is still one thread, right?" Timothy asked.

"One thread, one process, one GIL," Margaret confirmed. "No parallelism at all. Just very efficient concurrency."

She leaned forward to emphasize her next point. "This is critical: async/await only helps with I/O-bound work. It won't speed up CPU-bound tasks at all because everything still runs in one thread. For CPU parallelism, you still need multiprocessing."

Choosing the Right Tool

Timothy took a sip of his coffee. "So we have threading, multiprocessing, and async. When do I use which?"

Margaret pulled out a piece of paper and drew a decision tree:

# Decision Tree for Python Concurrency

# Question 1: Is your task CPU-bound or I/O-bound?

# CPU-BOUND (computation):
# - Image processing
# - Data analysis
# - Mathematical calculations
# - Algorithm implementations
#
# Use: multiprocessing
# Why: Only way to achieve true parallelism for Python code

# I/O-BOUND (waiting):
# - Network requests
# - File operations
# - Database queries
#
# Next question: How many I/O operations?

# MANY I/O operations (hundreds or thousands):
# Use: async/await
# Why: Lightweight, efficient, scales to thousands of concurrent operations
# Example: Web scraping 1000 URLs, handling many websocket connections
#
# FEW I/O operations (dozens or less):
# Use: threading
# Why: Simpler to understand, easier to add to existing code
# Example: Fetching data from 5 APIs, processing a handful of files
#
# Note: The threshold depends on your use case, but async's
# advantage grows with scale.

"Let me show you why this matters," Margaret said. She typed:

import asyncio
import threading
import time
import aiohttp
import requests

urls = ["https://httpbin.org/delay/1" for _ in range(50)]

# Threading approach
def fetch_threading():
    def fetch(url):
        requests.get(url)

    start = time.time()
    threads = []
    for url in urls:
        t = threading.Thread(target=fetch, args=(url,))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

    return time.time() - start

# Async approach
async def fetch_async():
    async def fetch(session, url):
        async with session.get(url) as response:
            await response.read()

    start = time.time()
    async with aiohttp.ClientSession() as session:
        tasks = [fetch(session, url) for url in urls]
        await asyncio.gather(*tasks)

    return time.time() - start

threading_time = fetch_threading()
async_time = asyncio.run(fetch_async())

print(f"Threading (50 requests): {threading_time:.2f}s")
print(f"Async (50 requests): {async_time:.2f}s")

Timothy ran it:

Threading (50 requests): 1.87s
Async (50 requests): 1.23s

"Async is faster," Timothy noted.

"And more importantly, more efficient. Those 50 threads consume a lot more memory than the async event loop. Try it with 1000 requests and threading becomes painful. Async scales much better."

The Real-World Pattern

They finished their coffee. Margaret showed Timothy one more example.

"In real applications, you often combine these strategies," she said. She typed:

from concurrent.futures import ProcessPoolExecutor
import asyncio
import aiohttp
import time

# CPU-intensive processing function
def process_data(data):
    """Simulate CPU-intensive work"""
    result = 0
    for i in range(1_000_000):
        result += i * data
    return result

# Async I/O + multiprocessing combined
async def fetch_and_process(session, url, executor):
    """Fetch data (async I/O), then process it (multiprocessing)"""
    # I/O-bound part: fetch data asynchronously
    async with session.get(url) as response:
        data = await response.read()

    # CPU-bound part: process in a separate process
    # run_in_executor schedules the function in the process pool
    loop = asyncio.get_event_loop()
    result = await loop.run_in_executor(executor, process_data, len(data))
    return result

async def main():
    urls = [f"https://httpbin.org/bytes/{size}" 
            for size in [1024, 2048, 4096, 8192]]

    # Create a process pool for CPU work
    with ProcessPoolExecutor(max_workers=4) as executor:
        async with aiohttp.ClientSession() as session:
            # Each task does async I/O + CPU processing
            tasks = [fetch_and_process(session, url, executor) 
                    for url in urls]
            results = await asyncio.gather(*tasks)

    return results

start = time.time()
results = asyncio.run(main())
print(f"Combined approach: {time.time() - start:.2f}s")
print(f"Processed {len(results)} items")

"See how it works?" Margaret pointed at the screen. "The async event loop handles all the network requests concurrently. When data arrives, run_in_executor sends the CPU-intensive processing to a separate process. The event loop doesn't block waiting for the CPU work—it continues handling other I/O."

"Async for the I/O coordination, multiprocessing for the CPU work," Timothy said.

"Exactly. You get the best of both worlds: efficient I/O concurrency and true CPU parallelism."

The Future: Life Without the GIL

Timothy had been thinking about something. "You mentioned at the start that the GIL might go away?"

"Maybe," Margaret said. "There's a project called PEP 703 - making the GIL optional in Python 3.13 and beyond."

She pulled up a webpage on her phone. "It's called 'nogil Python'. The idea is to make CPython work without the GIL, using more sophisticated locking mechanisms. It's experimental, but it's making progress."

"So threading will just work for CPU-bound tasks?" Timothy asked.

"In nogil Python, yes. But there are trade-offs. Single-threaded code might be slower without the GIL's simplicity. C extensions would need updates. It's a massive change."

"Is it happening?"

"It's being actively developed. Whether it becomes the default depends on whether the benefits outweigh the costs. But even if it does, multiprocessing and async/await will still be important. They solve different problems."

Understanding When to Use What

Timothy opened a new file and started taking notes. "Let me see if I've got this."

He typed:

# My Concurrency Cheat Sheet

# CPU-Bound Tasks (Pure Python computation):
# - Use: multiprocessing
# - Why: Only way to get true parallelism
# - Watch out for: Overhead with small tasks
from multiprocessing import Pool

def cpu_work(data):
    # Heavy computation
    pass

with Pool() as pool:
    results = pool.map(cpu_work, data)

# I/O-Bound Tasks (Network, files, database):
# - Few operations (< 50): Use threading
# - Many operations (> 50): Use async/await
# - Why: Both release GIL during I/O
# - Watch out for: Threading simpler but less scalable

# Threading for simple I/O
import threading

def io_work(url):
    # Fetch data
    pass

threads = [threading.Thread(target=io_work, args=(url,)) 
           for url in urls]
for t in threads:
    t.start()
for t in threads:
    t.join()

# Async for many I/O operations
import asyncio

async def io_work_async(url):
    # Fetch data
    pass

async def main():
    tasks = [io_work_async(url) for url in urls]
    await asyncio.gather(*tasks)

asyncio.run(main())

# Mixed Workload:
# - Use async for I/O coordination
# - Use multiprocessing for CPU work
# - Combine them with loop.run_in_executor()

"Perfect summary," Margaret said. "You've got it."

The Takeaway

Timothy closed his laptop, finally understanding Python's concurrency landscape.

Multiprocessing creates separate Python interpreters: Each with its own GIL, enabling true parallelism.

Process Pools manage workers efficiently: Reuse processes across tasks instead of creating new ones.

Multiprocessing has overhead: Serializing and deserializing data between processes takes time.

Only use multiprocessing for large tasks: Overhead dominates for small, quick operations.

Async/await is cooperative concurrency: One thread switching between tasks at await points.

Async doesn't use threads: It's single-threaded, event-loop based concurrency.

Async releases control at await points: Other tasks run while one task waits for I/O.

Threading is simpler for small I/O: Easier to understand and integrate into existing code.

Async scales better for many I/O operations: Can handle thousands of concurrent operations efficiently.

The event loop juggles tasks: Switches between them when they hit await points.

CPU-bound work needs multiprocessing: It's the only way to achieve true parallelism in Python.

I/O-bound work can use threading or async: Both release the GIL during I/O operations.

Async is more memory efficient: No per-thread overhead, scales to many concurrent operations.

You can combine strategies: Async for I/O coordination, multiprocessing for CPU work.

PEP 703 might make GIL optional: Nogil Python is under development for Python 3.13+.

Nogil has trade-offs: May slow single-threaded code, requires C extension updates.

The GIL will stay relevant: Even in a nogil future, understanding concurrency patterns matters.

Choose based on workload type: CPU-bound → multiprocessing, many I/O → async, few I/O → threading.

Understanding Concurrency

Timothy had discovered Python's strategies for concurrent and parallel execution.

Multiprocessing revealed that separate interpreters mean separate GILs, that creating processes has overhead but enables true CPU parallelism, and that Process Pools efficiently distribute work across a fixed number of worker processes.

He learned that async/await achieves concurrency without parallelism or threads, that cooperative multitasking switches between tasks at await points, and that the event loop can handle thousands of concurrent I/O operations in a single thread.

Moreover, Timothy understood that the choice between threading and async depends on scale, that threading is simpler for a few I/O operations while async scales better for many, and that neither threading nor async helps with CPU-bound Python code because they still share the GIL.

He learned that real applications often combine strategies, using async for I/O coordination and multiprocessing for CPU work, and that understanding the cost of serialization and inter-process communication is crucial for effective multiprocessing.

He understood that the GIL might become optional in future Python versions through PEP 703, but that this change involves trade-offs, and that even in a nogil future, understanding these concurrency patterns remains essential for writing efficient Python.

Most importantly, Timothy understood that choosing the right concurrency model starts with identifying whether work is CPU-bound or I/O-bound, that each strategy has specific strengths and costs, and that mastering Python's concurrency tools means knowing not just how they work but when to use each one.

The library was closing soon. As they packed up their laptops, Timothy felt he'd finally demystified one of Python's most misunderstood features. The GIL wasn't a limitation - it was a design choice that made sense once you understood the alternatives.

Previous in this series: The GIL Revealed: Why Python Threading Isn't Really Parallel

Aaron Rose is a software engineer and technology writer at tech-reader.blog and the author of Think Like a Genius.