π§ A bit of context: the GIL, what is it and what is it for?
The GIL (Global Interpreter Lock) is a mechanism present in the CPython implementation since forever. It ensures that only one thread can execute Python bytecode at a time. This mechanism has historically greatly simplified internal memory management in the interpreter.
Nevertheless, this global lock has a major downside: multi-threaded programs do not scale across multiple cores for CPU-bound workloads.
β‘οΈ In practice, threading.Thread
is mainly used for I/O.
For years, removing the GIL has been a recurring topic in the Python community. Since Python 3.13, it's becoming real!
π§ͺ Python 3.13 introduces an experimental mode: --disable-gil
This is the subject of PEP 703 β Making the Global Interpreter Lock Optional proposed by Sam Gross, which generated a lot of discussion. Since Python 3.13, it is possible to allow disabling the GIL β for this, the Python interpreter must be compiled with the --disable-gil
option. The nogil
has been improved in the Python 3.14.
With this mode enabled at compilation, the GIL can be controlled via the environment variable PYTHON_GIL=0|1
.
For example, to run a script in nogil
mode, simply do:
PYTHON_GIL=0 python3 my_script.py
Some benchmarks exist online, for example this one. But I wanted to dig a little deeper and evaluate the potential of this revolution!
π§ Setup for the tests
To run my tests, I created a Docker image that compiles Python 3.14b2 with the --disable-gil
option. The Dockerfile and all test cases are available here.
To build the Docker image:
$ docker build . -f Dockerfile-nogil -t python:nogil_3.14b2
βοΈ Benchmark #1: Heavy factorial in multi-thread
For the first test, we simply run a big factorial computation in several threads.
import threading
import math
import time
def test_function(num_thread):
thread_start_time = time.time()
math.factorial(250000)
thread_execution_time = time.time() - thread_start_time
start_time = time.time()
threads = []
for num_thread in range(5):
thread = threading.Thread(target=test_function, args=(num_thread,))
thread.start()
threads.append(thread)
for thread in threads:
thread.join()
execution_time = time.time() - start_time
print(f" Elapsed time: {execution_time:.2f} seconds")
π§Ύ Results
Mode | Total time |
---|---|
GIL ON | 1.97 s |
GIL OFF | 0.56 s |
β
Gain Γ3~4: here, --disable-gil
allows real parallelization of the heavy computation with very good scaling.
βοΈ Benchmark #2: Moderate mathematical calculations
Same idea as before, but with a slightly more varied math load.
import threading
import math
import time
import random
def stress_function(thread_id, complexity):
result = 0
for i in range(1, complexity):
a = math.sqrt(i) + math.sin(i) ** 2
b = math.log1p(i) * math.exp(-a)
c = math.factorial(i % 500 + 500) % 10**8
result += a * b + c
def run_threads(num_threads=4, complexity=10_000):
print(f"Launching {num_threads} threads with complexity {complexity}")
threads = []
start_time = time.time()
for i in range(num_threads):
t = threading.Thread(target=stress_function, args=(i, complexity))
t.start()
threads.append(t)
for t in threads:
t.join()
total_time = time.time() - start_time
print(f" Elapsed time {total_time:.2f} seconds")
if __name__ == "__main__":
run_threads(num_threads=5, complexity=20_000)
π§Ύ Results
Mode | Total time |
---|---|
GIL ON | 1.36 s |
GIL OFF | 0.41 s |
β Gain Γ3~4 again, consistent scaling.
βοΈ Benchmark #3: Shared concurrent access
Here we add writes to a shared variable (a Python list). Each thread writes to a distinct slot.
def stress_function(thread_id, complexity, shared_data, lock):
local_result = 0.0
for i in range(1, complexity):
a = math.sqrt(i) + math.sin(i) ** 2
b = math.log1p(i) * math.exp(-a)
c = math.factorial(i % 500 + 500) % 10**8
local_result += a * b + c
shared_data[thread_id] = local_result
def run_threads(num_threads=4, complexity=10_000):
threads = []
shared_data = [0.0] * num_threads
lock = threading.Lock()
start_time = time.time()
for i in range(num_threads):
t = threading.Thread(target=stress_function, args=(i, complexity, shared_data, lock))
t.start()
threads.append(t)
for t in threads:
t.join()
total_time = time.time() - start_time
total_sum = sum(shared_data)
print(f" Elapsed time: {total_time:.2f} seconds")
print(f" -> Global result: {total_sum % 1000:.2f}")
π§Ύ Results
Mode | Total time |
---|---|
GIL ON | 1.32 s |
GIL OFF | 0.43 s |
π Thanks to a clean design π (no thread collisions), nogil
remains very efficient here.
β οΈ Benchmark #4: Nearest neighbor search on 10M points
More realistic test: 10 million random 3D points, nearest neighbor search using multi-threading with a shared dictionary.
import threading
import random
import time
from math import sqrt
def distance2(a, b):
return (a[0]-b[0])**2 + (a[1]-b[1])**2 + (a[2]-b[2])**2
def find_closest_worker(points, query_point, start, end, shared_result, lock, tid):
local_min_dist = float("inf")
local_min_idx = -1
for i in range(start, end):
d = distance2(points[i], query_point)
if d < local_min_dist:
local_min_dist = d
local_min_idx = i
with lock:
if local_min_dist < shared_result['min_dist']:
shared_result['min_dist'] = local_min_dist
shared_result['closest_idx'] = local_min_idx
shared_result['owner'] = tid
def threaded_closest_point(points, query_point, num_threads=4):
n = len(points)
chunk_size = n // num_threads
shared_result = {
'min_dist': float("inf"),
'closest_idx': -1,
'owner': -1
}
lock = threading.Lock()
threads = []
for tid in range(num_threads):
start = tid * chunk_size
end = (tid + 1) * chunk_size if tid < num_threads - 1 else n
t = threading.Thread(target=find_closest_worker,
args=(points, query_point, start, end, shared_result, lock, tid))
t.start()
threads.append(t)
for t in threads:
t.join()
return shared_result['closest_idx'], sqrt(shared_result['min_dist']), shared_result['owner']
if __name__ == "__main__":
N = 10_000_000
num_threads = 4
points = [[random.random(), random.random(), random.random()] for _ in range(N)]
query_point = [random.random(), random.random(), random.random()]
print(f"Launching threaded NN search with {num_threads} threads")
t0 = time.time()
idx, dist, owner = threaded_closest_point(points, query_point, num_threads)
t1 = time.time()
print(f" Elapsed time: {t1 - t0:.2f} seconds")
π§Ύ Results
Mode | Total time |
---|---|
GIL ON | 2.38 s |
GIL OFF | 3.61 s β |
β And here, big surprise: enabling nogil
makes things slower. I donβt have a certain explanation. But if I had to guess, I would say concurrent access to the big Python list causes some limitations.
π Bonus: C++ version + nanobind
I also tested this case using native C++ multi-threading, exposed to Python via nanobind
. Mainly to see if using exactly the same logic but switching the tech stack would yield different results. And yes:
Implementation | Total time |
---|---|
Python GIL ON | 2.38 s |
Python GIL OFF | 3.61 s β |
C++ (threads) | 0.88 s β |
In the C++ thread version we do get a big performance boost. So the logic is good β or at least not too bad β the limitation comes from Python πΏ.
π Conclusion
Python is improving β that's undeniable and exciting. But for heavily concurrent cases, native implementations (C/C++) still hold a clear advantage.
- β
Python 3.14
--disable-gil
brings real gains for CPU-bound multi-threaded tasks - β οΈ But performance is not always better, especially with large volumes of data or memory contention
Top comments (0)
Some comments may only be visible to logged-in visitors. Sign in to view all comments.