Sergey Dobrov

Posted on Nov 27

You don't understand GIL

#python #gil #performance #programming

Some time ago I was chatting with a friend about programming languages, and the conversation drifted — inevitably — to why Python is “bad.”

The first argument was that Python has “no types,” which makes it error-prone.
I pushed back: Python’s modern type system is surprisingly expressive, and meanwhile Java manages to produce endless null-pointer exceptions despite all its ceremony. That point didn’t land well.

So the next argument came out:

“Anyway, Python can’t even use more than one CPU core because of the GIL.”

I didn’t try to debate it.
Instead, I opened a terminal, ran a data-processing script I had lying around, and showed them top.

One Python process was using exactly 300% CPU. They did not believe it.

And that moment captures something important: the GIL is one of the most confidently misunderstood ideas in all of programming.

Why the GIL Seems Simple but Isn’t

What makes this even trickier is that the GIL actually exists to make Python simple.
It lets most of the runtime behave like a friendly, high-level environment where you never have to think about memory management, object lifetimes, or thread safety inside the interpreter.

But concurrency is where that abstraction finally hits its boundary.
The GIL doesn’t behave the same way in every situation — it takes different forms under different workloads.

That’s why so many explanations are technically correct yet still misleading.

So let’s peel it back one layer at a time.

Layer 1: “Threads Are Useless in Python”

A common starting point is the idea that Python threads are “useless” because of the GIL.

If that were true, CPython wouldn’t bother with real OS threads — it could have simulated concurrency with simple user-level green threads.

But CPython uses real pthreads for a reason: threads matter in Python.

Layer 2: “Threads Only Help for I/O”

A slightly more sophisticated version of this is:

“Threads are only useful when something blocks on I/O. Otherwise the GIL stops everything.”

This sounds reasonable — but it’s still not right.

And we already saw that in the very beginning: our Python process was happily using 300% CPU, with no I/O involved at all.

So clearly something else is going on.

Layer 3: “Threads Help for I/O or Native Code — So Problem Solved?”

Once people realize threads aren’t limited to I/O, they usually expand the model:

“Okay, fine — if a thread is blocked on I/O or running C code, another thread can run. That’s the whole story.”

Closer, but still not exactly. Because here’s another catch:

Not all C libraries release the GIL. Some explicitly request it.

And when they do, the interpreter goes right back to single-threaded behavior, even though no Python code is executing.

This is why you can see one native operation scale beautifully across cores, while another operation — also written in C — completely freezes out every other thread in the process.

Layer 4: Why Some C Libraries Must Hold the GIL

So, the “I/O or C computation frees another thread” intuition is still incomplete.
It depends entirely on what the C library is doing under the hood... and whether it needs exclusive access to Python objects or interpreter state.

Every Python object — even something as small as an int or a bytes object — carries a reference count that tracks how many places are using it.
Incrementing and decrementing these counts must happen one at a time.
If two threads updated them independently, you’d get leaked objects, prematurely freed objects, or corrupted memory.

But refcounts are only the beginning.

Any C code that:

creates or destroys Python objects
mutates a Python object (appending to a list, updating a dict, modifying a set)
raises exceptions
calls back into Python code

…relies on parts of the interpreter that assume exclusive access.

And the only way to guarantee that is to hold the GIL.

This is why one C extension can run happily across multiple cores while another — also written in C — blocks every thread in the process.
It depends entirely on whether that extension needs to interact with Python objects or whether it can work on its own data structures without consulting the interpreter.

Layer 5: Why “Just Use Multiprocessing” Is an Oversimplification

When people get frustrated with the GIL, the next instinct is usually: “Okay, forget threads. Just use multiprocessing.”

On paper it sounds perfect — each process has its own GIL, so you get true parallelism, and sometimes that is the right approach.

But as a general rule, it’s still an oversimplification.

Multiprocessing has real costs:

Most data still has to be serialized to move between processes (Shared memory only works for a limited set of data types)
Even with copy-on-write on Unix, large Python objects often end up duplicated anyway — refcount updates alone are enough to break CoW
Starting a process has much higher overhead than starting a thread (a new interpreter instance, new memory space, new OS structures — not just a new stack)
Context switching between processes is heavier for the OS than between threads
Coordinating state is harder when it can’t be freely shared

So yes — multiprocessing sidesteps the GIL for CPU-bound Python bytecode.
But it brings fully separate runtimes, higher startup costs, heavier context switches, and more complicated data sharing.

How to Tell Whether a C Library Releases the GIL

At this point a natural question comes up:

“So how do I know whether a C library actually releases the GIL?”

Unfortunately, you can’t always tell from the outside — two libraries that look identical in Python can have completely different GIL behavior.
But there are a few practical ways to reason about it.

1. If a C extension touches Python objects, it must hold the GIL — but only for those parts
This doesn’t mean it needs the GIL for the entire function.
A well-written C extension typically does this:

Acquire GIL
Inspect or convert Python inputs (refcounts, type checks, copies, etc.)
Release GIL
Run the heavy native computation
Reacquire GIL
Build Python output objects
Return

So:

touching Python objects → requires the GIL
heavy inner computation → often does not

This is why you can see “C code + threads” giving real multi-core speedups despite Python objects being involved at the boundaries.

Some of the standard library’s own C extensions follow this pattern.
For example, zlib, bz2, and hashlib all parse Python arguments and allocate Python objects while holding the GIL, then drop the GIL around the inner compression or hashing loop, and reacquire it only to wrap up the result.

2. Pure native computation usually releases the GIL cleanly
Libraries like:

zlib / bz2 / lzma
hashing
crypto
NumPy (when you do array1 + array2, NumPy releases the GIL around the actual arithmetic loop. But when you do something like array.tolist(), it can't because it's creating Python objects)
many image codecs

operate on raw buffers or on their own internal data structures.
They typically wrap the expensive part with:

Py_BEGIN_ALLOW_THREADS
    /* heavy computation */
Py_END_ALLOW_THREADS

This is how you get things like “Python using 300% CPU” from a single process.

3. Some native libraries can’t release the GIL much — because their core work is Python interaction

Examples:

regex engines working on Python string objects (standard re module doesn't release GIL)
Python-level parsing loops (e.g. json)
libraries that mutate lists or dicts internally (e.g. py-radix)
things that allocate many intermediate Python objects (e.g. csv, json)

These must hold the GIL almost the whole time.

4. Docs might mention it — but inconsistently

Some libraries explicitly say “this releases the GIL,” but many don’t. (e.g. NumPy, SciPy)

If documentation does say it, you can trust it.
If it doesn’t—no conclusion can be drawn.

5. You can measure it
A simple test:

run the operation in many threads
watch CPU usage in top or htop

If you see a single core saturated → it's likely holding the GIL
If you see multiple cores fully used → it's releasing the GIL
If you see partial scaling → it's partial GIL release (very common)

With Big Power Comes Big Responsibility

Releasing the GIL — or running code that doesn’t use it — doesn’t magically make concurrency “safe.” It just gives you more freedom and more ways to shoot yourself in the foot.

If two threads start modifying the same NumPy array, or the same shared buffer, or the same Python object through a C extension, nothing protects you anymore. You can get classic race conditions, torn writes, and inconsistent data just like in any other multithreaded language.

The GIL wasn’t only a limitation — it was also a guardrail.
Once it’s out of the way, you have to be just as careful as you would be in C++ or Java.

Conclusion

The GIL has sharp edges and real limitations — but it also hides a lot of complexity and keeps Python usable without turning every piece of code into a concurrency puzzle.

Once you understand the layers behind it, the whole picture becomes much clearer:

threads aren’t “useless,”
I/O isn’t the whole story,
native code can run in parallel,
native code sometimes can’t,
and “just use multiprocessing” solves one problem while introducing several others.

In practice, Python can make excellent use of multiple cores — it just depends on what kind of work you’re doing and which libraries you’re using.

I’d love to hear your own experiences:

times when multiprocessing was total overkill,
or when a single Python process maxed out all your cores
or situations where threads surprised you (for better or worse)

Leave a comment, share a story, or correct me if I got something wrong — the whole point of this post is to make the conversation around the GIL more honest and less magical.

tl;dr

Your workload is:
├─ Pure Python computation → multiprocessing
├─ I/O bound (network, disk) → threading or asyncio
├─ Calling C libraries
│  ├─ Don't know if GIL-safe → measure with top/htop or check docs
│  ├─ Releases GIL → threading is fine
│  └─ Requests GIL → multiprocessing or asyncio
└─ Many small tasks → consider overhead cost

DEV Community