Basile Marchand

Posted on Jun 11, 2025

🚀 Benchmarks of Python 3.14b2 with --disable-gil

🧠 A bit of context: the GIL, what is it and what is it for?

The GIL (Global Interpreter Lock) is a mechanism present in the CPython implementation since forever. It ensures that only one thread can execute Python bytecode at a time. This mechanism has historically greatly simplified internal memory management in the interpreter.

Nevertheless, this global lock has a major downside: multi-threaded programs do not scale across multiple cores for CPU-bound workloads.

➡️ In practice, threading.Thread is mainly used for I/O.

For years, removing the GIL has been a recurring topic in the Python community. Since Python 3.13, it's becoming real!

🧪 Python 3.13 introduces an experimental mode: `--disable-gil`

This is the subject of PEP 703 – Making the Global Interpreter Lock Optional proposed by Sam Gross, which generated a lot of discussion. Since Python 3.13, it is possible to allow disabling the GIL — for this, the Python interpreter must be compiled with the --disable-gil option. The nogil has been improved in the Python 3.14.

With this mode enabled at compilation, the GIL can be controlled via the environment variable PYTHON_GIL=0|1.

For example, to run a script in nogil mode, simply do:

PYTHON_GIL=0 python3 my_script.py

Some benchmarks exist online, for example this one. But I wanted to dig a little deeper and evaluate the potential of this revolution!

🔧 Setup for the tests

To run my tests, I created a Docker image that compiles Python 3.14b2 with the --disable-gil option. The Dockerfile and all test cases are available here.

To build the Docker image:

$ docker build . -f Dockerfile-nogil -t python:nogil_3.14b2

⚙️ Benchmark #1: Heavy factorial in multi-thread

For the first test, we simply run a big factorial computation in several threads.

import threading
import math
import time 

def test_function(num_thread):
    thread_start_time = time.time()
    math.factorial(250000)
    thread_execution_time = time.time() - thread_start_time

start_time = time.time()

threads = []
for num_thread in range(5):
    thread = threading.Thread(target=test_function, args=(num_thread,))
    thread.start()
    threads.append(thread)

for thread in threads:
    thread.join()

execution_time = time.time() - start_time
print(f"     Elapsed time: {execution_time:.2f} seconds")

🧾 Results

Mode	Total time
GIL ON	1.97 s
GIL OFF	0.56 s

✅ Gain ×3~4: here, --disable-gil allows real parallelization of the heavy computation with very good scaling.

⚙️ Benchmark #2: Moderate mathematical calculations

Same idea as before, but with a slightly more varied math load.

import threading
import math
import time
import random

def stress_function(thread_id, complexity):
    result = 0
    for i in range(1, complexity):
        a = math.sqrt(i) + math.sin(i) ** 2
        b = math.log1p(i) * math.exp(-a)
        c = math.factorial(i % 500 + 500) % 10**8
        result += a * b + c

def run_threads(num_threads=4, complexity=10_000):
    print(f"Launching {num_threads} threads with complexity {complexity}")
    threads = []

    start_time = time.time()
    for i in range(num_threads):
        t = threading.Thread(target=stress_function, args=(i, complexity))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()
    total_time = time.time() - start_time
    print(f"   Elapsed time {total_time:.2f} seconds")

if __name__ == "__main__":
    run_threads(num_threads=5, complexity=20_000)

🧾 Results

Mode	Total time
GIL ON	1.36 s
GIL OFF	0.41 s

✅ Gain ×3~4 again, consistent scaling.

⚙️ Benchmark #3: Shared concurrent access

Here we add writes to a shared variable (a Python list). Each thread writes to a distinct slot.

def stress_function(thread_id, complexity, shared_data, lock):
    local_result = 0.0
    for i in range(1, complexity):
        a = math.sqrt(i) + math.sin(i) ** 2
        b = math.log1p(i) * math.exp(-a)
        c = math.factorial(i % 500 + 500) % 10**8
        local_result += a * b + c
    shared_data[thread_id] = local_result

def run_threads(num_threads=4, complexity=10_000):
    threads = []
    shared_data = [0.0] * num_threads
    lock = threading.Lock()

    start_time = time.time()
    for i in range(num_threads):
        t = threading.Thread(target=stress_function, args=(i, complexity, shared_data, lock))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

    total_time = time.time() - start_time
    total_sum = sum(shared_data)
    print(f"     Elapsed time: {total_time:.2f} seconds")
    print(f"        -> Global result: {total_sum % 1000:.2f}")

🧾 Results

Mode	Total time
GIL ON	1.32 s
GIL OFF	0.43 s

🔒 Thanks to a clean design 😎 (no thread collisions), nogil remains very efficient here.

⚠️ Benchmark #4: Nearest neighbor search on 10M points

More realistic test: 10 million random 3D points, nearest neighbor search using multi-threading with a shared dictionary.

import threading
import random
import time
from math import sqrt

def distance2(a, b):
    return (a[0]-b[0])**2 + (a[1]-b[1])**2 + (a[2]-b[2])**2


def find_closest_worker(points, query_point, start, end, shared_result, lock, tid):
    local_min_dist = float("inf")
    local_min_idx = -1

    for i in range(start, end):
        d = distance2(points[i], query_point)
        if d < local_min_dist:
            local_min_dist = d
            local_min_idx = i

    with lock:
        if local_min_dist < shared_result['min_dist']:
            shared_result['min_dist'] = local_min_dist
            shared_result['closest_idx'] = local_min_idx
            shared_result['owner'] = tid


def threaded_closest_point(points, query_point, num_threads=4):
    n = len(points)
    chunk_size = n // num_threads

    shared_result = {
        'min_dist': float("inf"),
        'closest_idx': -1,
        'owner': -1
    }
    lock = threading.Lock()
    threads = []

    for tid in range(num_threads):
        start = tid * chunk_size
        end = (tid + 1) * chunk_size if tid < num_threads - 1 else n
        t = threading.Thread(target=find_closest_worker,
                             args=(points, query_point, start, end, shared_result, lock, tid))
        t.start()
        threads.append(t)

    for t in threads:
        t.join()

    return shared_result['closest_idx'], sqrt(shared_result['min_dist']), shared_result['owner']

if __name__ == "__main__":
    N = 10_000_000
    num_threads = 4
    points = [[random.random(), random.random(), random.random()] for _ in range(N)]
    query_point = [random.random(), random.random(), random.random()]

    print(f"Launching threaded NN search with {num_threads} threads")
    t0 = time.time()
    idx, dist, owner = threaded_closest_point(points, query_point, num_threads)
    t1 = time.time()

    print(f"    Elapsed time: {t1 - t0:.2f} seconds")

🧾 Results

Mode	Total time
GIL ON	2.38 s
GIL OFF	3.61 s ❗

❌ And here, big surprise: enabling nogil makes things slower. I don’t have a certain explanation. But if I had to guess, I would say concurrent access to the big Python list causes some limitations.

🔁 Bonus: C++ version + `nanobind`

I also tested this case using native C++ multi-threading, exposed to Python via nanobind. Mainly to see if using exactly the same logic but switching the tech stack would yield different results. And yes:

Implementation	Total time
Python GIL ON	2.38 s
Python GIL OFF	3.61 s ❗
C++ (threads)	0.88 s ✅

In the C++ thread version we do get a big performance boost. So the logic is good — or at least not too bad — the limitation comes from Python 😿.

🔚 Conclusion

Python is improving — that's undeniable and exciting. But for heavily concurrent cases, native implementations (C/C++) still hold a clear advantage.

✅ Python 3.14 --disable-gil brings real gains for CPU-bound multi-threaded tasks
⚠️ But performance is not always better, especially with large volumes of data or memory contention

📚 To learn more

Top comments (1)

Dinesh Devaraj • Oct 19 '25

Item 1&2 is about in-memory, no-shared-data workloads this is very rare in real time applications. Item-4 is in-memory shared-data read-only workload.
Item-3 is in-memory shared-data read-write workloads but the write operation happens in different slot. I hope most AI operations fall under this category.
Missing scenarios is in-memory shared-data read-write workload. Also, NW, IO and DB based workload benchmarks are missing.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.

🧠 A bit of context: the GIL, what is it and what is it for?

🧪 Python 3.13 introduces an experimental mode: --disable-gil

🔧 Setup for the tests

⚙️ Benchmark #1: Heavy factorial in multi-thread

🧾 Results

⚙️ Benchmark #2: Moderate mathematical calculations

🧾 Results

⚙️ Benchmark #3: Shared concurrent access

🧾 Results

⚠️ Benchmark #4: Nearest neighbor search on 10M points

🧾 Results

🔁 Bonus: C++ version + nanobind

🔚 Conclusion

📚 To learn more

🧪 Python 3.13 introduces an experimental mode: `--disable-gil`

🔁 Bonus: C++ version + `nanobind`