Adilet Akmatov

Posted on Mar 21 • Edited on Mar 22

Python Internals I Wish I Knew

#python #cpython #performance #programming

I came to Python from business — 6 years running operations, then 2 years ago switched to building automation tools. Bots, scrapers, Telegram integrations, desktop tools packaged with PyInstaller that clients use on Windows machines in retail stores across Central Asia.

For most of that time, I thought I understood Python pretty well.

Then I hit a bug I couldn't explain.

The Bug That Started All of This

I was building a high-performance web scraper — processing around 40,000 HTML pages per run, concurrently, using Playwright and BeautifulSoup. Production numbers looked fine at first. Then I noticed the process was taking 1.2 seconds per document. Over 40k documents, that becomes 13+ hours. Not acceptable.

I profiled it. The bottleneck was BeautifulSoup parsing — pure Python HTML traversal on every page. So I did what seemed logical: rewrote the hot path as a C extension using libxml2.

Result: 0.09 seconds per document. 13x speedup.

But why exactly did that work? I knew that it worked — Python is slow, C is fast. That's obvious. What I didn't understand was the mechanism. What was Python doing that made it 13x slower? What did the C extension eliminate?

That gap bothered me enough that I cloned the CPython repo and started reading.

Three months later, I understood. And it changed how I think about every Python program I write.

This post is what I found.

CPython Is Just a C Program
The Life of a Script: From Source to Execution
PyObject — The Real Cost of x = []
Memory Management: Three Tiers
Reference Counting — The Deterministic Heart
Garbage Collector — When Counters Aren't Enough
The GIL — The Most Misunderstood Thing in Python
The Eval Loop — The Heart of the Machine
C Extensions — What Actually Made My Scraper 13x Faster
PyPy, Python 3.13 Free-Threading, and What's Coming
What Changed in My Actual Code

Quick Mental Model

Before we dig in — here's the whole thing on one screen:

Layer	What It Does	Lives In
Compiler	Turns `.py` into bytecode	`Python/compile.c`
Virtual Machine	Executes bytecode instruction by instruction	`Python/ceval.c`
Object Model	Everything is a `PyObject` with a type and a refcount	`Objects/`
Memory	Reference Counting + cyclic GC	`Objects/obmalloc.c`
The Constraint	GIL — one thread executes bytecode at a time	`Python/ceval_gil.c`

.py → [Lexer] → [Parser] → AST → [Compiler] → Bytecode → [Eval Loop] → result
                                                                  ↕
                                                      PyObject / pymalloc / GC

Five layers. That's Python. Now let's open each one.

1. CPython Is Just a C Program

When I first opened the CPython repo, I expected something complicated and exotic. What I found was surprisingly clean C code with good comments and consistent naming.

When you run python script.py, you're launching CPython — an interpreter written in plain C (C11 standard per PEP 7). No C++, no JVM, no Rust. Just C. That decision was deliberate — maximum portability, minimum complexity, maximum contributor accessibility.

The directory structure maps directly to what the runtime does:

CPython repository (github.com/python/cpython):
├── Python/       ← Compiler, Eval Loop, builtins
├── Objects/      ← int, str, list, dict as C structs
├── Modules/      ← C-side of stdlib (json, re, os...)
├── Include/      ← Public C API headers for extensions
├── Lib/          ← Python-side of stdlib
└── Misc/         ← Docs, release notes

My recommendation: open Objects/listobject.c and read it. The actual implementation of Python lists. It's about 2,000 lines of very readable C. Once you've done that, Python lists stop being magic and start being a thing you understand. Everything else in the stdlib becomes easier to reason about after that.

2. The Life of a Script

You type python script.py and hit enter. Your source code goes through six stages before anything executes:

Source (.py)
    │
    ▼
[Lexer / Tokenizer]        ← tokenize.py / tokenizer.c
    │  Text → tokens: NAME, NUMBER, OP...
    ▼
[Parser]                   ← Parser/parser.c (PEG since Python 3.9)
    │  Tokens → AST (Abstract Syntax Tree)
    ▼
[AST Optimizer]            ← Python/ast_opt.c
    │  Constant folding: TIMEOUT = 60 * 60 * 24 → 86400 at compile time
    ▼
[Compiler]                 ← Python/compile.c
    │  AST → bytecode instructions
    ▼
[Bytecode cache (.pyc)]    ← __pycache__/script.cpython-312.pyc
    │  Cached to disk; skips steps 1–4 on next run
    ▼
[Eval Loop (ceval.c)]      ← Python/ceval.c
    │  Executes bytecode instruction by instruction
    ▼
[Result]

You can inspect what CPython produces from your code:

import dis

def add(a, b):
    return a + b

dis.dis(add)

Output:

  2           0 RESUME                   0

  3           2 LOAD_FAST                0 (a)
              4 LOAD_FAST                1 (b)
              6 BINARY_OP               0 (+)
             10 RETURN_VALUE

Four instructions for a + b. Each is a 1-byte opcode plus an argument. This is the thing Python is actually running — not your source code. When I started using dis regularly on my scrapers and bot code, I started seeing why some things were slow in ways profiling alone didn't reveal.

The .pyc file in __pycache__? Just a serialized code object packed with Python's marshal module. Python checks timestamp and version magic on startup — unchanged file means steps 1–4 are completely skipped.

3. PyObject — The Real Cost of x = []

This is the thing that changed how I write code more than anything else.

In Python, everything is an object. Under the hood, "object" means one specific thing: a C struct called PyObject.

// Include/object.h
typedef struct _object {
    Py_ssize_t    ob_refcnt;   // reference counter
    PyTypeObject *ob_type;     // pointer to the type (int, str, list...)
} PyObject;

Two fields. 16 bytes on a 64-bit system. Every value you've ever created in Python — every integer, every string, every function, every None — starts with these 16 bytes.

Each concrete type extends this with its own data:

// List (Objects/listobject.c)
typedef struct {
    PyObject_VAR_HEAD       // ob_refcnt + ob_type + ob_size
    PyObject **ob_item;     // array of POINTERS to elements
    Py_ssize_t allocated;   // capacity (not length)
} PyListObject;

Note: ob_item is an array of pointers, not objects. A Python list doesn't contain its elements — it contains pointers to them. This is how [1, "hello", None, []] works without type-specific list types. The list just holds PyObject* pointers and doesn't care what they point to.

Why `int` Costs 28 Bytes

import sys
# Python 3.12+
sys.getsizeof(0)      # → 24  (was 28 in Python ≤ 3.11)
sys.getsizeof(1)      # → 28
sys.getsizeof(2**30)  # → 32
sys.getsizeof(2**60)  # → 36

Python integers are bignums — no overflow, ever. Internally they store digits in base 2^30 as an array of uint32_t. That array grows as numbers grow. In C, an int is 4 bytes. In Python, the smallest integer costs 28.

In my scraper, I was creating hundreds of thousands of small integer objects per run as counters and indices. Each one: 28 bytes minimum, an allocation call, a reference count increment. At scale, that adds up.

Small Integer Interning

CPython pre-creates integers from -5 to 256 at startup and reuses them forever:

a = 100
b = 100
a is b  # True — same object in memory, zero allocation

a = 1000
b = 1000
a is b  # False — two separate allocations

Loop indices, boolean flags, small return codes — all of these hit the interned pool. Without this optimization, tight loops would hammer the allocator on numbers like 0, 1, True, False.

⚠️ Never use is for value comparison. a is b checks memory address. Past 256, "equal" integers are different objects. a == b is always correct.

String Interning

CPython silently interns strings that look like identifiers. Strings with spaces or symbols: not interned.

a = "status"
b = "status"
a is b  # True — looks like an identifier, interned

a = "status: ok"
b = "status: ok"
a is b  # False — space breaks the rule

import sys
a = sys.intern("status: ok")
b = sys.intern("status: ok")
a is b  # True — explicit

I use sys.intern() in systems that do millions of repeated dictionary lookups with the same string keys — like an ORM layer processing repeated field names or a JSON cache with known keys. String comparison becomes O(1) pointer check instead of O(n) character scan.

4. Memory Management: Three Tiers

When I write x = [], Python doesn't just call malloc. It runs through a three-tier allocation system:

┌──────────────────────────────────────────────────────┐
│  Tier 3: pymalloc  (Objects/obmalloc.c)               │
│  Objects ≤ 512 bytes — the vast majority              │
│  Custom arena/pool system, up to 10× faster           │
├──────────────────────────────────────────────────────┤
│  Tier 2: PyMem API  (PyMem_Malloc / PyMem_Free)       │
│  Thin wrapper, used for debug hooks                   │
├──────────────────────────────────────────────────────┤
│  Tier 1: OS Allocator  (malloc / free from libc)      │
│  Objects > 512 bytes and internal structures          │
└──────────────────────────────────────────────────────┘

Nearly everything lands in Tier 3. Here's how it works.

pymalloc: Arenas, Pools, Blocks

Arena (256 KB) ← one mmap() call to the OS
├── Pool [size_class=8]    ← all blocks 8 bytes
│   ├── [block][block][block]...
├── Pool [size_class=16]
...
└── Pool [size_class=512]

Term	Size	What It Is
Block	8–512 bytes (×8)	Holds one object
Pool	4 KB (OS page)	Blocks of the same size
Arena	256 KB	Up to 64 pools

Some old posts say Arena = 1 MB. Wrong. Check it: grep ARENA_SIZE cpython/Objects/obmalloc.c → #define ARENA_SIZE (256 << 10)

When you need a 24-byte block, pymalloc checks if there's a pool with that size class and a free slot. If yes — take it, done, zero system calls. That's why creating thousands of small objects in a loop is fast. The OS isn't involved on most allocations.

Why Python "Holds Onto" Memory

This one burned me on a bot that ran 24/7. After processing large batches, the process would stay at peak memory indefinitely — even after I deleted everything.

The rule:

Freed block → back to its pool
Empty pool → back to its arena
Empty arena only → released to OS

If you allocate 400 MB of small objects, delete them all, but one object in one pool in one arena is still alive — that 256 KB arena doesn't go back to the OS. From top's perspective: process still uses 400 MB.

This is not a bug. It's the arena pool holding memory for reuse. Knowing this means you stop chasing phantom memory leaks that aren't actually leaks.

5. Reference Counting — The Deterministic Heart

Every PyObject has ob_refcnt. When it drops to zero, the object is destroyed. Immediately. Not "next GC cycle" — immediately.

// Include/object.h (simplified)
#define Py_INCREF(op)  ((PyObject*)(op))->ob_refcnt++

#define Py_DECREF(op)                                \
    if (--((PyObject*)(op))->ob_refcnt == 0)         \
        _Py_Dealloc((PyObject*)(op))

These two macros are everywhere in CPython. Assign a variable → INCREF. Pass to a function → INCREF. Exit a scope → DECREF. It's constant background churn — the price of deterministic memory management.

Watching It Live

import sys

a = []
print(sys.getrefcount(a))  # 2 — variable 'a' + getrefcount's arg

b = a
print(sys.getrefcount(a))  # 3 — + 'b'

c = [a, a, a]
print(sys.getrefcount(a))  # 6 — + three list slots

del b
print(sys.getrefcount(a))  # 5

del c
print(sys.getrefcount(a))  # 2 — all three references gone at once

del a
# → 0 → _Py_Dealloc() → block back to pymalloc pool

getrefcount always reads one too high — the function call itself holds a temporary reference. Tripped me up for longer than I'd like to admit.

The Cascade Effect

When ob_refcnt hits zero, the destructor calls DECREF on everything the object holds. If a list has 10,000 items and you delete the list, all 10,000 items get DECREF-ed in one chain reaction. If any of those items also hit zero, they cascade too.

For my bots and scrapers, this actually matters. A del response_cache at the end of a batch run triggers an immediate cascade deallocation — predictable, synchronous, no surprises. This is genuinely useful when you're managing memory across long-running processes.

Context Managers Are Deterministic in CPython

In Java or Go, a file might stay open until the next GC cycle. In CPython:

with open("data.txt") as f:
    data = f.read()

# Here: f's refcount drops to 0 → f.__exit__() → file closed → memory freed
# Right here. Not "eventually".

This is exactly why with open(...) as f guarantees the file closes the moment you exit the block — not "eventually."

The `del` Trap

registry = []

class Zombie:
    def __del__(self):
        registry.append(self)  # refcount > 0 again — resurrection!

z = Zombie()
del z         # __del__ fires — but object survives into registry
registry.clear()  # dies for real — __del__ does NOT fire again

CPython calls __del__ exactly once. If the object survives, fine. When it finally dies, the destructor won't run again.

Don't use __del__ for cleanup. It's not reliable for resources. Use context managers.

The Fatal Flaw: Cycles

a = {}
a["self"] = a   # a → a → a...
del a           # refcount drops to 1, not 0. Nobody can reach it. Won't die.

Reference counting is completely blind to cycles. This is why the GC exists.

6. Garbage Collector — When Counters Aren't Enough

Python's GC (Modules/gcmodule.c) has one job: find cycles that reference counting missed and destroy them.

It doesn't track everything. Only container objects that can hold references — list, dict, set, custom class instances. A str is never GC-tracked because strings can't point to other objects.

Three Generations

Generation 0  (threshold=700)   ← new objects
Generation 1  (threshold=10)    ← survived 1 collection
Generation 2  (threshold=10)    ← long-lived

Gen 0 runs frequently. Gen 2 rarely. The logic: most objects die young. The ones that survive long enough are probably going to live forever.

Python 3.14: GC became incremental — work is split into small chunks interleaved with execution. No more single stop-the-world pause. Big deal for latency-sensitive services.

How It Finds Cycles

1. Copy ob_refcnt → gc_refs (scratch field)
2. For each tracked object: walk its references,
   subtract 1 from gc_refs of each pointed-to object
3. After the walk:
   gc_refs > 0  → reachable from outside → ALIVE
   gc_refs == 0 → only referenced within cycle → DEAD → destroy

Elegant. No global reachability scan — just local reference subtraction.

In Production

import gc

# ⚠️ Only disable if you're certain there are no cycles.
# In my bots: loggers, closures, Playwright browser contexts — all have cycles.
# Disabling GC with cycles = silent memory growth until OOM.
gc.disable()

gc.collect(0)  # Gen 0 only
gc.collect()   # Everything

print(gc.get_count())      # (480, 3, 1)
print(gc.get_threshold())  # (700, 10, 10)
gc.set_debug(gc.DEBUG_LEAK)

In my high-throughput scrapers, I've used scheduled GC to avoid surprise pauses mid-request:

import gc, threading, time, logging
logger = logging.getLogger(__name__)

def gc_worker():
    while True:
        time.sleep(30)
        n = gc.collect()
        logger.debug(f"GC collected {n} objects")

threading.Thread(target=gc_worker, daemon=True).start()

A predictable pause you control beats a random pause you don't. But: measure first. Most codebases don't need this.

`weakref` — Better Than Fighting Cycles

The right move is often to not create cycles in the first place. In any tree structure where nodes need a back-reference to their parent, use weakref:

import weakref

class NodeSafe:
    def __init__(self, value):
        self.value = value
        self._parent = None

    @property
    def parent(self):
        return self._parent() if self._parent else None

    @parent.setter
    def parent(self, node):
        self._parent = weakref.ref(node) if node else None

root = NodeSafe("root")
child = NodeSafe("child")
child.parent = root   # doesn't increment root's refcount

del root
print(child.parent)  # → None immediately. No GC needed.

I use this pattern in any linked/tree data structure. Also useful for caches where you want entries to disappear automatically when nothing else holds the value.

7. The GIL — The Most Misunderstood Thing in Python

I spent years being confused by the GIL. Let me save you the confusion.

The GIL (Global Interpreter Lock) is a mutex. It enforces one invariant: only one thread executes Python bytecode at a given moment.

// Python/ceval_gil.c
static _Py_atomic_int eval_breaker;   // "yield now" flag
static _Py_atomic_int gil_locked;     // 0 = free, 1 = held

// sys.getswitchinterval() = 5ms by default
// This is a CHECK INTERVAL — not a guaranteed switch time.
// Real switching only happens when a waiting thread requests
// the GIL within that window.

Why It Exists

Reference counting is not thread-safe. Two threads touching ob_refcnt concurrently is a data race. Without synchronization, you'd get _Py_Dealloc firing twice on the same object — use-after-free, crash. The GIL is the "one big lock" solution to this. For 1991 when CPython was first written, it was a reasonable engineering choice.

The trade-off: simplicity and safety in exchange for CPU parallelism.

What It Doesn't Block

This is the part most people miss.

The GIL is released during I/O. Every network call, every disk read, every time.sleep — the GIL drops for the duration. While one thread waits on the network, other threads run bytecode. Real concurrency.

In my Telegram bots, I run hundreds of concurrent polling connections. Threading works great for this because 99% of the time each thread is blocked waiting on an API response — GIL is free for whoever needs it.

import threading, urllib.request

def fetch(url):
    urllib.request.urlopen(url)  # GIL released during entire network wait

threads = [threading.Thread(target=fetch, args=(url,)) for url in urls]
for t in threads: t.start()
for t in threads: t.join()
# Actual concurrent I/O across all threads

C extensions can also explicitly drop the GIL during computation. That's how numpy, pandas, and Pillow achieve real multi-core performance — Python layer holds the GIL, C computation layer drops it.

❗ The Rule to Remember

The GIL doesn't make Python slow — it just means CPython can't use multiple CPU cores in parallel.

CPU-bound task  →  multiprocessing  (each process = its own GIL)
I/O-bound task  →  threading/asyncio (GIL releases on every blocking call)
C extension     →  threading + Py_BEGIN_ALLOW_THREADS

from concurrent.futures import ProcessPoolExecutor

def heavy(n):
    return sum(i * i for i in range(n))

with ProcessPoolExecutor(max_workers=4) as exe:
    results = list(exe.map(heavy, [1_000_000] * 8))
# 4 real cores, no GIL contention

In my automation work: API calls, web scraping, Telegram polling — threading works perfectly. Parsing HTML, running ML inference, crunching numbers — multiprocessing or a C extension.

8. The Eval Loop — The Heart of the Machine

Python/ceval.c. This is where every Python instruction you've ever executed actually ran.

~3,000 lines of C. One for(;;) loop. One switch statement inside with a case for each opcode. That's it.

// Simplified — the real version is much more complex
PyObject *
_PyEval_EvalFrameDefault(PyThreadState *tstate, PyFrameObject *f, int exc)
{
    for (;;) {
        if (_Py_atomic_load_relaxed(&eval_breaker)) {
            // Handle signals, check whether to yield the GIL, etc.
        }

        opcode = NEXTOPARG();

        switch (opcode) {
            case LOAD_FAST: {
                PyObject *value = GETLOCAL(oparg);
                Py_INCREF(value);
                PUSH(value);
                break;
            }
            case BINARY_OP: {
                PyObject *right = POP();
                PyObject *left  = POP();
                // Reality: dispatches through _PyNumber_BinaryOp()
                // → left->ob_type->tp_as_number->nb_add
                // → PyNumber_Add → slot lookup
                // Simplified here for readability.
                PyObject *res = binary_op(left, right, oparg);
                Py_DECREF(left); Py_DECREF(right);
                PUSH(res);
                break;
            }
            case RETURN_VALUE: {
                retval = POP();
                goto return_or_yield;
            }
        }
    }
}

Python is a stack machine. Instructions push values onto a virtual stack and pop them to perform operations:

[LOAD_FAST a]  → Stack: [42]
[LOAD_FAST b]  → Stack: [42, 8]
[BINARY_OP +]  → Stack: [50]
[RETURN_VALUE] → returns 50, frame destroyed

Frame Objects

Each function call creates a PyFrameObject — locals, current bytecode position, pointer to the calling frame. The f_back chain is the call stack you see in tracebacks:

import sys

def inner():
    frame = sys._getframe()
    print(frame.f_code.co_name)         # 'inner'
    print(frame.f_back.f_code.co_name)  # 'outer'

def outer():
    inner()
outer()

Since Python 3.11, frames are often C-stack allocated rather than heap allocated. Deep recursion got noticeably faster from this alone.

The Adaptive Interpreter (Python 3.11+)

This is genuinely clever. The Eval Loop now watches what types your code uses. After seeing the same instruction with the same types enough times, it rewrites that instruction in place with a specialized version:

BINARY_OP   →  BINARY_OP_ADD_INT      (both operands are int — skip type checks)
LOAD_ATTR   →  LOAD_ATTR_MODULE       (attribute is from a module — direct pointer)
CALL        →  CALL_PY_EXACT_ARGS     (argument count matches exactly)

Python rewrites its own bytecode at runtime based on what it observes your code doing. A specialized opcode skips the entire type dispatch chain and goes straight to the optimized C path.

Result: +25% throughput versus Python 3.10 with no changes to your code. This is also the foundation the experimental JIT in Python 3.13+ builds on.

9. C Extensions — What Actually Made My Scraper 13x Faster

Back to the original question.

Any .so on Linux/macOS or .pyd on Windows is a dynamic library built against the Python C API. This is the mechanism behind numpy, lxml, ujson, orjson, Pillow — Python shell, C engine.

Here's what my scraper change looked like conceptually:

# Before: pure Python, 1.2s per document
def extract_links(html: str) -> list[str]:
    from bs4 import BeautifulSoup
    soup = BeautifulSoup(html, "html.parser")
    return [a["href"] for a in soup.find_all("a", href=True)]

What Python was doing under the hood for each document:

Create dozens of PyObject instances for DOM nodes
Run refcount increments/decrements on every attribute access
All HTML traversal going through the Eval Loop opcode by opcode
GIL held for the entire duration — no other thread could run

The C extension path:

static PyObject *
fast_extract_links(PyObject *self, PyObject *args)
{
    const char *html;
    PyArg_ParseTuple(args, "s", &html);

    Py_BEGIN_ALLOW_THREADS   // ← GIL released
    // ... libxml2 parsing in pure C, no PyObjects, no refcounts ...
    Py_END_ALLOW_THREADS     // ← GIL re-acquired

    return result_list;
}

What changed: the HTML traversal went from "thousands of opcode dispatches + PyObject allocations" to "C function calls with direct memory access." No Eval Loop overhead. No per-step refcount churn. GIL released so other threads ran in parallel. 13x.

Building a Minimal C Extension

// mymodule.c
#define PY_SSIZE_T_CLEAN
#include <Python.h>

static PyObject *
double_it(PyObject *self, PyObject *args)
{
    int n;
    if (!PyArg_ParseTuple(args, "i", &n))
        return NULL;
    return PyLong_FromLong(n * 2);
}

static PyMethodDef methods[] = {
    {"double_it", double_it, METH_VARARGS, "Double an integer"},
    {NULL, NULL, 0, NULL}
};

static struct PyModuleDef module = {
    PyModuleDef_HEAD_INIT, "mymodule", NULL, -1, methods
};

PyMODINIT_FUNC PyInit_mymodule(void) {
    return PyModule_Create(&module);
}

python setup.py build_ext --inplace

import mymodule
mymodule.double_it(21)  # → 42

Reference Management — Where Things Go Wrong

The most common bug in C extensions: getting refcounts wrong. Leak one INCREF → memory leak. Miss one DECREF → use-after-free. The standard cleanup pattern:

static PyObject *
safe_build_list(PyObject *self, PyObject *args)
{
    PyObject *list = NULL, *item = NULL;

    list = PyList_New(3);
    if (!list) goto error;

    for (int i = 0; i < 3; i++) {
        item = PyLong_FromLong(i);
        if (!item) goto error;

        PyList_SET_ITEM(list, i, item);  // list steals the reference
        item = NULL;                      // we don't own it anymore
    }

    return list;

error:
    Py_XDECREF(item);  // XDECREF is NULL-safe
    Py_XDECREF(list);
    return NULL;
}

The Modern Toolkit

Nobody writes raw C extensions unless they have a specific reason. The usual choices:

Tool	When
ctypes	Call existing `.dll`/`.so` without compiling anything
cffi	More robust ctypes for complex C APIs
Cython	Python-like syntax → compiles to C; great for loops
pybind11	Wrapping existing C++ code — industry standard
mypyc	Type-annotated Python → C extension, zero new syntax

10. PyPy, Python 3.13 Free-Threading, and What's Coming

PyPy

PyPy is an alternative Python interpreter with a tracing JIT compiler. It watches what your code does and compiles hot paths to machine code. Loops that take seconds in CPython: milliseconds in PyPy, no source changes needed.

The trade-off: PyPy uses tracing GC instead of reference counting. No deterministic __del__. Objects die when the GC decides, not when the last reference disappears.

pypy3 script.py  # often faster with zero changes

Use it for: pure Python algorithms, number crunching without numpy, long-running services with well-defined hot paths.

Avoid it when: you depend heavily on C extensions like numpy, pandas, Playwright. They're built for the CPython ABI and range from slower to broken under PyPy.

CPython 3.13: Free-Threading (PEP 703)

Python 3.13 shipped an experimental build with the GIL removed:

pyenv install 3.13t  # 't' = free-threaded
python3.13t -c "import sys; print(sys._is_gil_enabled())"  # False

Removing the GIL required solving the thread-safety problem differently:

Biased Reference Counting — fast per-thread counters for thread-owned objects; slower shared counters for cross-thread objects
Immortal Objects (PEP 683) — None, True, False get a fixed sentinel refcount and are never modified, eliminating all races on the most common objects
Per-object locks on dict, list, and other mutable containers

Currently opt-in — most C extensions aren't compatible yet. But this is the most significant architectural change to CPython in its history.

Architecture Map

Your code (.py)
      │
      ▼
 [PEG Parser]  →  AST  →  [Compiler]  →  Bytecode
                                               │
                                               ▼
                                        [Eval Loop]
                                         ceval.c
                                               │
                              ┌────────────────┼────────────────┐
                              │                │                │
                              ▼                ▼                ▼
                        [PyObject]      [Memory Mgmt]    [C Extensions]
                        ob_refcnt       pymalloc           .so / .pyd
                        ob_type         arenas/pools       Python C API
                              │                │
                              ▼                ▼
                        [Ref Counting]   [GC (cyclic)]
                        Py_INCREF/       Mark & Sweep
                        Py_DECREF        3 generations

11. What Actually Changed in My Code

This is the part nobody writes. Not "here are optimizations" — but what specifically changed in how I write real production code after understanding CPython internals.

1. I use __slots__ on any class I instantiate at volume.
My Playwright bot creates thousands of event objects and DOM node wrappers per session. Before: ~280 bytes per object. After __slots__: ~56 bytes. On a 24-hour run, that difference is significant.

2. I bind module functions before tight loops.
Bots and scrapers have hot loops. json.loads in a loop → _loads = json.loads above the loop. This is LOAD_FAST vs LOAD_GLOBAL on every iteration. In a loop over 100k items, it's measurable.

3. I stopped using threading for CPU-bound work.
Sounds obvious in hindsight. But I had a pipeline where I was running threading.Thread for parallel HTML parsing, wondering why I wasn't getting faster. Now I know: parsing is CPU-bound, GIL doesn't release, threads compete instead of parallelize. Switched to ProcessPoolExecutor, got the expected speedup.

4. I reach for C extensions or Cython for genuine hot paths.
Not prematurely — only after profiling. But knowing that the cost difference is "thousands of Eval Loop opcode dispatches" vs "direct C function calls" gives me a clear mental model for when the overhead matters enough to pay the engineering cost.

5. I use weakref in data structures with back-references.
My bots have conversation state graphs where nodes reference parent nodes. Without weakref, the GC has to find and clean up the cycles. With weakref, parent references are free — no cycle, no GC involvement, immediate cleanup.

6. When a memory leak appears, I know where to look first.
Before: guessing, adding del everywhere, restarting services. Now: is it a cycle? Is it a module-level cache growing unbounded? Is pymalloc holding freed arenas? Is a C extension leaking a refcount? I have a mental model that makes the diagnostic systematic instead of random.

Conclusion

Python isn't slow because it's poorly designed. It's slow because it does an enormous amount of work on your behalf.

Every x = [] is an allocator call, a refcount, a type pointer, a GC registration. Every a + b is a type dispatch chain. Every function call is a new frame object. Python hides this so you can think about the problem. The hiding is the point.

Understanding it doesn't mean you write lower-level code. It means you know when the abstraction is costing you something and when it isn't. You stop cargo-culting optimizations and start making decisions based on what's actually happening in the runtime.

That's the difference.

Next steps if you want to go further:

Clone CPython: git clone https://github.com/python/cpython
Read Objects/listobject.c — list implementation, including list.sort() (timsort)
Read Objects/longobject.c — why integers work the way they do
Run dis.dis() on your hot functions and look at what Python actually executes
python -m cProfile -s cumulative your_script.py before optimizing anything

If you're preparing for senior/lead interviews, Part 2 will cover how to talk about CPython internals in a technical interview — what's worth knowing, what questions to expect, how to frame deep knowledge without sounding like you're reciting a textbook.

I write about Python automation, bots, and infrastructure — came from 6 years in business, now 2 years deep in the engineering side.

python #cpython #performance #programming #automation #webdev

Top comments (2)

Adilet Akmatov • Mar 21

Thanks for reading! If you found this useful, please heart ❤️ and bookmark 🔖 the post. Every interaction helps push this technical depth to more developers.

Some comments may only be visible to logged-in visitors. Sign in to view all comments.