The Origin of the GIL
Let's start with a question: are Python's built-in objects like list and dict thread-safe?
At the Python level, yes — list, dict, and other built-in objects are thread-safe. That's common knowledge. But when we study their source code, we find no mutex locks anywhere. That's a bit surprising.
Take the append method of list as an example:
static int
app1(PyListObject *self, PyObject *v)
{
Py_ssize_t n = PyList_GET_SIZE(self);
assert (v != NULL);
if (n == PY_SSIZE_T_MAX) {
PyErr_SetString(PyExc_OverflowError,
"cannot add more objects to list");
return -1;
}
if (list_resize(self, n+1) < 0) // Step 1: expand length by 1
return -1;
Py_INCREF(v);
PyList_SET_ITEM(self, n, v); // Step 2: set the new element
return 0;
}
append is not an atomic operation. Imagine thread A calls append, expands the list length, then gets context-switched out. Thread B wakes up and accesses l[-1] — the last element. Since the length was already expanded but the element hasn't been set yet, thread B gets an invalid object!
Thread A: list_resize(n+1) ← expands length
↓ context switch!
Thread B: access l[-1] ← gets uninitialized memory ❌
↓ context switch back
Thread A: PyList_SET_ITEM ← too late
This kind of race condition caused by concurrent access is normally solved with a mutex lock — one lock per object:
static int
some_op(PyListObject *self, ...)
{
lock(self); // acquire lock
// ... operation ...
unlock(self); // release lock
}
But we don't see this in CPython's source. So what's going on?
Enter the GIL
Python uses a different approach: the Global Interpreter Lock (GIL) — a single global lock for the entire interpreter. Any Python thread that wants to execute bytecode must first acquire the GIL. At any given moment, only one thread runs inside the VM.
How does the VM alternate between threads? Python borrows the time-slice concept from OS process scheduling — except instead of time, it counts bytecodes. When a thread acquires the GIL and starts executing, it counts executed bytecodes. After a certain number (e.g., 100), the thread voluntarily releases the GIL and wakes up other waiting threads.
Thread lifecycle under the GIL:
┌──────────────────────────────────────────────────────────┐
│ Python VM (single slot) │
│ │
│ [Thread A: Running] ──bytecode limit──▶ releases GIL │
│ ↓ ↓ │
│ [Thread A: Ready] [Thread B: acquires GIL]│
│ ↓ │
│ [Thread B: Running] │
└──────────────────────────────────────────────────────────┘
Thread states:
┌──────────────┬────────────────────────────────────────────┐
│ Running │ Currently executing bytecode (holds GIL) │
│ Ready │ Waiting to acquire the GIL │
│ IO Blocked │ Blocked on a system call, GIL released │
└──────────────┴────────────────────────────────────────────┘
- Running → Ready: bytecode limit reached, releases GIL voluntarily
- Running → IO Blocked: about to make a blocking syscall, releases GIL first
- IO Blocked → Ready: syscall returns, waits to re-acquire GIL
- Ready → Running: acquires GIL, starts executing
Why GIL Instead of Per-Object Locks?
| Approach | Advantage | Disadvantage |
|---|---|---|
| Per-object locks | Threads can run in parallel | Huge lock/unlock overhead |
| GIL | No frequent lock/unlock needed | Only one thread runs at a time |
Benchmarks show that the parallelism gained from per-object locks is almost entirely cancelled out by the locking overhead — especially for frequently-used objects like dict. In single-threaded environments, GIL has a clear advantage. So the Python authors' choice makes sense.
Impact of the GIL
Under the GIL, only one thread runs in the VM at any moment. Does this mean multithreading is completely useless for performance? It depends on the program's characteristics.
Programs generally operate in two states:
- Runnable: either Running or Ready — competing for CPU resources
- Blocked: waiting for I/O — yielding CPU resources
Based on the ratio of time spent in each state, programs fall into two categories:
- CPU-bound: most time spent in Running state
- I/O-bound: most time spent in IO Blocked state
I/O-Bound Programs
Single thread timeline:
[=run=][~~~IO wait~~~][=run=][~~~IO wait~~~][=run=]
Multi-thread timeline (GIL released during IO):
Thread A: [=run=][~~~IO~~~~~~~~~~~][=run=]
Thread B: [=run=][~~~IO~~~][=run=][=run=]
↑ Thread B runs while A waits for IO
I/O-bound programs are relatively unaffected by the GIL. When a thread is waiting for I/O, it releases the GIL, allowing other threads to run.
A classic example: a multi-threaded web crawler. Most time is spent waiting for server responses. A multi-threaded crawler can dramatically reduce total runtime.
CPU-Bound Programs
Single thread timeline:
[============================== compute ==============================]
Multi-thread timeline (GIL forces alternation):
Thread A: [=run=][wait][=run=][wait][=run=][wait]
Thread B: [wait][=run=][wait][=run=][wait][=run=]
Combined: same total time, no speedup ❌
CPU-bound programs are heavily impacted by the GIL. Python threads cannot execute in parallel across multiple cores — multithreading provides no speedup for CPU-bound work.
How to Achieve True Multi-Core Execution
Let's use Monte Carlo π estimation — a classic CPU-bound problem — to explore Python's multi-core options.
The Algorithm
Draw a unit square with an inscribed quarter-circle (radius = 1):
(0,1)
│ ╭──────
│ ╱ sector
│ ╱
│╱
└──────── (1,0)
Square area = 1
Sector area = π/4
P(point lands in sector) = π/4
→ π ≈ 4 × (points in sector) / (total points)
The more trials, the more accurate the estimate — and the more computation required.
Implementation
import random
def sample(n):
in_sectors = 0
for _ in range(n):
x, y = random.random(), random.random()
if x*x + y*y < 1:
in_sectors += 1
return n, in_sectors
def eval_pi(n, in_sectors):
return 4. * in_sectors / n
Single-threaded: baseline
def estimate_pi(n):
return eval_pi(*sample(n))
# 100 million trials: 4.59 seconds
Multi-threaded: no improvement
from queue import Queue
from threading import Thread
def estimate_pi_with_threads(n, thread_num):
queue = Queue()
def thread_routine():
queue.put(sample(n // thread_num))
threads = [Thread(target=thread_routine) for _ in range(thread_num)]
for thread in threads:
thread.start()
total_n, total_in_sectors = 0, 0
for thread in threads:
n, in_sectors = queue.get()
total_n += n
total_in_sectors += in_sectors
for thread in threads:
thread.join()
return eval_pi(total_n, total_in_sectors)
# 100 million trials: 4.63 seconds — no improvement ❌
The GIL forces threads to alternate. Total computation time is the same.
Multi-process: real parallelism ✅
Each Python process runs its own independent VM instance — no GIL contention between processes. Different processes can run truly in parallel across multiple CPU cores.
from multiprocessing import Process, Queue as ProcessQueue
def estimate_pi_with_processes(n, process_num):
queue = ProcessQueue()
def process_routine():
queue.put(sample(n // process_num))
processes = [Process(target=process_routine) for _ in range(process_num)]
for process in processes:
process.start()
total_n, total_in_sectors = 0, 0
for process in processes:
n, in_sectors = queue.get()
total_n += n
total_in_sectors += in_sectors
for process in processes:
process.join()
return eval_pi(total_n, total_in_sectors)
# 100 million trials: 2.56 seconds — nearly 2x faster! ✅
Benchmark Summary
| Mode | Time (100M trials) | Notes |
|---|---|---|
| Single-threaded | 4.59s | Baseline |
| Multi-threaded (4 threads) | 4.63s | GIL cancels out parallelism |
| Multi-process (2 processes) | 2.56s | ~1.8x speedup ✅ |
⚠️ Note: Adding more processes beyond the number of CPU cores provides no further benefit.
Top comments (0)