https://www.youtube.com/watch?v=JXrPfI08euE
Two programs. The same loop — sum every integer from 0 to 100 million. One in Python, one in C. Same algorithm, same answer.
C finishes in 0.82 seconds. Python takes 92 seconds. That's 112× slower.
Everyone who's ever written Python knows it's "slow." Very few know why. The answer isn't the GIL. The answer isn't a missing compiler — Python has one. The answer is what happens on every single iteration.
What a + b Actually Costs
In C, a + b compiles to a single machine instruction. ADD. Two registers. One clock cycle. Done.
In Python, that same line triggers a cascade of work on every iteration. Let's walk through it.
Step 1: Dispatch. Python compiles a + b into a bytecode instruction called BINARY_OP. The interpreter — a big C loop inside CPython — fetches the instruction, decodes it, and jumps to the handler. Every iteration pays this cost.
Step 2: Figure out what we're adding. Integers? Floats? Strings? Lists? The interpreter has to look. It follows pointers to each operand's type descriptor. Since Python 3.11 this hot path is specialized — but the machinery is still there.
Step 3: The actual addition. Here's where it gets expensive.
A Python integer is not four bytes of data. It's a full object on the heap. On a typical 64-bit CPython build:
| Field | Size | What it holds |
|---|---|---|
ob_refcnt |
8 bytes | Reference count |
ob_type |
8 bytes | Pointer to the int type |
ob_size / lv_tag
|
8 bytes | Size and sign |
ob_digit[] |
4+ bytes | The actual number, in 30-bit digits |
| Total | ~28 bytes | For a single small integer |
Every integer in Python looks like this. The number 42. The number 0. All of them. Heap objects with headers.
So to add two of them, Python has to unwrap both — reach past the headers for the digits — add the digits, then allocate a brand new object on the heap to hold the result. Malloc. Zero out memory. Write the header. Write the digits. Return a pointer.
Step 4: Refcounts. Python increments the count on the new object and decrements on the old values. More memory writes.
That's one iteration of your Python loop: dispatch, type check, two header lookups, heap allocation, refcount bookkeeping. For what C does in one instruction.
Now multiply by 100 million.
The Assembly Showdown
Here's what C's -O2 optimizer produces for the inner loop:
loop:
add x19, x19, x8 ; s += i
add x8, x8, #1 ; i++
cmp x8, x9
b.lt loop
Four instructions. Registers only. No memory allocation. No function calls.
And CPython's equivalent? It's in a file called ceval.c. The handler for a single BINARY_OP on two integers walks through: opcode fetch, branch to handler, pop two stack values, dispatch to the type's nb_add slot, type checks, unpack digits, call long_add, allocate a new PyLongObject, zero its memory, write header, write digits, return pointer, push onto stack, refcount bookkeeping, jump back.
Dozens of C function calls per Python iteration. Hundreds of instructions. For what C does in one.
The C compiler can also vectorize — on the right shape of loop it uses SIMD to add multiple numbers per instruction. Python's interpreter can't see the loop as a loop. It sees opcodes, and runs them one at a time.
Is Python Stuck Here?
No.
import numpy as np
s = np.arange(100_000_000).sum()
0.1 seconds. Faster than our C version.
Because NumPy isn't Python. NumPy is a thin Python wrapper around a C library. The array is a contiguous block of raw bytes — packed int32s, one after the other. And .sum() is compiled C code, often vectorized, hitting your CPU's add instructions directly.
Same answer. Same Python-looking API. But the loop runs in C.
That's the trick every fast Python library uses. NumPy, Pandas, PyTorch, scikit-learn — they aren't magic. They're C, wearing a Python mask.
Python has other escape hatches too: PyPy uses a tracing JIT, Cython compiles Python-like code to native, and Python 3.13 ships an experimental JIT if you build it with --enable-experimental-jit.
The Real Lesson
When Python is slow, it's not because Python is broken. It's because a Python for-loop is doing something C doesn't do — pushing every number through an interpreter that treats every integer as a heap object.
Once you know that, you know when to reach for NumPy and when a plain for-loop is fine.
And there's a fascinating story inside NumPy — how it keeps that loop running at C speed without losing the Python feel. But that's for another video.
Top comments (0)