DEV Community

James Lee
James Lee

Posted on

Python GIL: Why One Lock Rules the Entire Interpreter

The Origin of the GIL

Let's start with a question: are Python's built-in objects like list and dict thread-safe?

At the Python level, yes — list, dict, and other built-in objects are thread-safe. That's common knowledge. But when we study their source code, we find no mutex locks anywhere. That's a bit surprising.

Take the append method of list as an example:

static int
app1(PyListObject *self, PyObject *v)
{
    Py_ssize_t n = PyList_GET_SIZE(self);

    assert (v != NULL);
    if (n == PY_SSIZE_T_MAX) {
        PyErr_SetString(PyExc_OverflowError,
            "cannot add more objects to list");
        return -1;
    }

    if (list_resize(self, n+1) < 0)   // Step 1: expand length by 1
        return -1;

    Py_INCREF(v);
    PyList_SET_ITEM(self, n, v);       // Step 2: set the new element
    return 0;
}
Enter fullscreen mode Exit fullscreen mode

append is not an atomic operation. Imagine thread A calls append, expands the list length, then gets context-switched out. Thread B wakes up and accesses l[-1] — the last element. Since the length was already expanded but the element hasn't been set yet, thread B gets an invalid object!

Thread A: list_resize(n+1)  ← expands length
     ↓ context switch!
Thread B: access l[-1]      ← gets uninitialized memory ❌
     ↓ context switch back
Thread A: PyList_SET_ITEM   ← too late
Enter fullscreen mode Exit fullscreen mode

This kind of race condition caused by concurrent access is normally solved with a mutex lock — one lock per object:

static int
some_op(PyListObject *self, ...)
{
    lock(self);    // acquire lock
    // ... operation ...
    unlock(self);  // release lock
}
Enter fullscreen mode Exit fullscreen mode

But we don't see this in CPython's source. So what's going on?

Enter the GIL

Python uses a different approach: the Global Interpreter Lock (GIL) — a single global lock for the entire interpreter. Any Python thread that wants to execute bytecode must first acquire the GIL. At any given moment, only one thread runs inside the VM.

How does the VM alternate between threads? Python borrows the time-slice concept from OS process scheduling — except instead of time, it counts bytecodes. When a thread acquires the GIL and starts executing, it counts executed bytecodes. After a certain number (e.g., 100), the thread voluntarily releases the GIL and wakes up other waiting threads.

Thread lifecycle under the GIL:

┌──────────────────────────────────────────────────────────┐
│                   Python VM (single slot)                │
│                                                          │
│  [Thread A: Running] ──bytecode limit──▶ releases GIL   │
│       ↓                                       ↓         │
│  [Thread A: Ready]              [Thread B: acquires GIL]│
│                                       ↓                 │
│                              [Thread B: Running]        │
└──────────────────────────────────────────────────────────┘

Thread states:
┌──────────────┬────────────────────────────────────────────┐
│  Running     │ Currently executing bytecode (holds GIL)   │
│  Ready       │ Waiting to acquire the GIL                 │
│  IO Blocked  │ Blocked on a system call, GIL released     │
└──────────────┴────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
  • Running → Ready: bytecode limit reached, releases GIL voluntarily
  • Running → IO Blocked: about to make a blocking syscall, releases GIL first
  • IO Blocked → Ready: syscall returns, waits to re-acquire GIL
  • Ready → Running: acquires GIL, starts executing

Why GIL Instead of Per-Object Locks?

Approach Advantage Disadvantage
Per-object locks Threads can run in parallel Huge lock/unlock overhead
GIL No frequent lock/unlock needed Only one thread runs at a time

Benchmarks show that the parallelism gained from per-object locks is almost entirely cancelled out by the locking overhead — especially for frequently-used objects like dict. In single-threaded environments, GIL has a clear advantage. So the Python authors' choice makes sense.


Impact of the GIL

Under the GIL, only one thread runs in the VM at any moment. Does this mean multithreading is completely useless for performance? It depends on the program's characteristics.

Programs generally operate in two states:

  • Runnable: either Running or Ready — competing for CPU resources
  • Blocked: waiting for I/O — yielding CPU resources

Based on the ratio of time spent in each state, programs fall into two categories:

  • CPU-bound: most time spent in Running state
  • I/O-bound: most time spent in IO Blocked state

I/O-Bound Programs

Single thread timeline:
[=run=][~~~IO wait~~~][=run=][~~~IO wait~~~][=run=]

Multi-thread timeline (GIL released during IO):
Thread A: [=run=][~~~IO~~~~~~~~~~~][=run=]
Thread B:        [=run=][~~~IO~~~][=run=][=run=]
                  ↑ Thread B runs while A waits for IO
Enter fullscreen mode Exit fullscreen mode

I/O-bound programs are relatively unaffected by the GIL. When a thread is waiting for I/O, it releases the GIL, allowing other threads to run.

A classic example: a multi-threaded web crawler. Most time is spent waiting for server responses. A multi-threaded crawler can dramatically reduce total runtime.

CPU-Bound Programs

Single thread timeline:
[============================== compute ==============================]

Multi-thread timeline (GIL forces alternation):
Thread A: [=run=][wait][=run=][wait][=run=][wait]
Thread B: [wait][=run=][wait][=run=][wait][=run=]
Combined: same total time, no speedup ❌
Enter fullscreen mode Exit fullscreen mode

CPU-bound programs are heavily impacted by the GIL. Python threads cannot execute in parallel across multiple cores — multithreading provides no speedup for CPU-bound work.


How to Achieve True Multi-Core Execution

Let's use Monte Carlo π estimation — a classic CPU-bound problem — to explore Python's multi-core options.

The Algorithm

Draw a unit square with an inscribed quarter-circle (radius = 1):

(0,1)
  │   ╭──────
  │  ╱ sector
  │ ╱
  │╱
  └──────── (1,0)

Square area  = 1
Sector area  = π/4

P(point lands in sector) = π/4
→ π ≈ 4 × (points in sector) / (total points)
Enter fullscreen mode Exit fullscreen mode

The more trials, the more accurate the estimate — and the more computation required.

Implementation

import random

def sample(n):
    in_sectors = 0
    for _ in range(n):
        x, y = random.random(), random.random()
        if x*x + y*y < 1:
            in_sectors += 1
    return n, in_sectors

def eval_pi(n, in_sectors):
    return 4. * in_sectors / n
Enter fullscreen mode Exit fullscreen mode

Single-threaded: baseline

def estimate_pi(n):
    return eval_pi(*sample(n))

# 100 million trials: 4.59 seconds
Enter fullscreen mode Exit fullscreen mode

Multi-threaded: no improvement

from queue import Queue
from threading import Thread

def estimate_pi_with_threads(n, thread_num):
    queue = Queue()

    def thread_routine():
        queue.put(sample(n // thread_num))

    threads = [Thread(target=thread_routine) for _ in range(thread_num)]

    for thread in threads:
        thread.start()

    total_n, total_in_sectors = 0, 0
    for thread in threads:
        n, in_sectors = queue.get()
        total_n += n
        total_in_sectors += in_sectors

    for thread in threads:
        thread.join()

    return eval_pi(total_n, total_in_sectors)

# 100 million trials: 4.63 seconds — no improvement ❌
Enter fullscreen mode Exit fullscreen mode

The GIL forces threads to alternate. Total computation time is the same.

Multi-process: real parallelism ✅

Each Python process runs its own independent VM instance — no GIL contention between processes. Different processes can run truly in parallel across multiple CPU cores.

from multiprocessing import Process, Queue as ProcessQueue

def estimate_pi_with_processes(n, process_num):
    queue = ProcessQueue()

    def process_routine():
        queue.put(sample(n // process_num))

    processes = [Process(target=process_routine) for _ in range(process_num)]

    for process in processes:
        process.start()

    total_n, total_in_sectors = 0, 0
    for process in processes:
        n, in_sectors = queue.get()
        total_n += n
        total_in_sectors += in_sectors

    for process in processes:
        process.join()

    return eval_pi(total_n, total_in_sectors)

# 100 million trials: 2.56 seconds — nearly 2x faster! ✅
Enter fullscreen mode Exit fullscreen mode

Benchmark Summary

Mode Time (100M trials) Notes
Single-threaded 4.59s Baseline
Multi-threaded (4 threads) 4.63s GIL cancels out parallelism
Multi-process (2 processes) 2.56s ~1.8x speedup ✅

⚠️ Note: Adding more processes beyond the number of CPU cores provides no further benefit.

Top comments (0)