Neural Download

Posted on Apr 17

Why Python Is 100x Slower Than C

#python #pythonvsc #pythonisslow #cpython

https://www.youtube.com/watch?v=JXrPfI08euE

Two programs. The same loop — sum every integer from 0 to 100 million. One in Python, one in C. Same algorithm, same answer.

C finishes in 0.82 seconds. Python takes 92 seconds. That's 112× slower.

Everyone who's ever written Python knows it's "slow." Very few know why. The answer isn't the GIL. The answer isn't a missing compiler — Python has one. The answer is what happens on every single iteration.

What `a + b` Actually Costs

In C, a + b compiles to a single machine instruction. ADD. Two registers. One clock cycle. Done.

In Python, that same line triggers a cascade of work on every iteration. Let's walk through it.

Step 1: Dispatch. Python compiles a + b into a bytecode instruction called BINARY_OP. The interpreter — a big C loop inside CPython — fetches the instruction, decodes it, and jumps to the handler. Every iteration pays this cost.

Step 2: Figure out what we're adding. Integers? Floats? Strings? Lists? The interpreter has to look. It follows pointers to each operand's type descriptor. Since Python 3.11 this hot path is specialized — but the machinery is still there.

Step 3: The actual addition. Here's where it gets expensive.

A Python integer is not four bytes of data. It's a full object on the heap. On a typical 64-bit CPython build:

Field	Size	What it holds
`ob_refcnt`	8 bytes	Reference count
`ob_type`	8 bytes	Pointer to the int type
`ob_size` / `lv_tag`	8 bytes	Size and sign
`ob_digit[]`	4+ bytes	The actual number, in 30-bit digits
Total	~28 bytes	For a single small integer

Every integer in Python looks like this. The number 42. The number 0. All of them. Heap objects with headers.

So to add two of them, Python has to unwrap both — reach past the headers for the digits — add the digits, then allocate a brand new object on the heap to hold the result. Malloc. Zero out memory. Write the header. Write the digits. Return a pointer.

Step 4: Refcounts. Python increments the count on the new object and decrements on the old values. More memory writes.

That's one iteration of your Python loop: dispatch, type check, two header lookups, heap allocation, refcount bookkeeping. For what C does in one instruction.

Now multiply by 100 million.

The Assembly Showdown

Here's what C's -O2 optimizer produces for the inner loop:

loop:
    add   x19, x19, x8     ; s += i
    add   x8,  x8,  #1     ; i++
    cmp   x8,  x9
    b.lt  loop

Four instructions. Registers only. No memory allocation. No function calls.

And CPython's equivalent? It's in a file called ceval.c. The handler for a single BINARY_OP on two integers walks through: opcode fetch, branch to handler, pop two stack values, dispatch to the type's nb_add slot, type checks, unpack digits, call long_add, allocate a new PyLongObject, zero its memory, write header, write digits, return pointer, push onto stack, refcount bookkeeping, jump back.

Dozens of C function calls per Python iteration. Hundreds of instructions. For what C does in one.

The C compiler can also vectorize — on the right shape of loop it uses SIMD to add multiple numbers per instruction. Python's interpreter can't see the loop as a loop. It sees opcodes, and runs them one at a time.

Is Python Stuck Here?

No.

import numpy as np
s = np.arange(100_000_000).sum()

0.1 seconds. Faster than our C version.

Because NumPy isn't Python. NumPy is a thin Python wrapper around a C library. The array is a contiguous block of raw bytes — packed int32s, one after the other. And .sum() is compiled C code, often vectorized, hitting your CPU's add instructions directly.

Same answer. Same Python-looking API. But the loop runs in C.

That's the trick every fast Python library uses. NumPy, Pandas, PyTorch, scikit-learn — they aren't magic. They're C, wearing a Python mask.

Python has other escape hatches too: PyPy uses a tracing JIT, Cython compiles Python-like code to native, and Python 3.13 ships an experimental JIT if you build it with --enable-experimental-jit.

The Real Lesson

When Python is slow, it's not because Python is broken. It's because a Python for-loop is doing something C doesn't do — pushing every number through an interpreter that treats every integer as a heap object.

Once you know that, you know when to reach for NumPy and when a plain for-loop is fine.

And there's a fascinating story inside NumPy — how it keeps that loop running at C speed without losing the Python feel. But that's for another video.

DEV Community

Why Python Is 100x Slower Than C

What `a + b` Actually Costs

The Assembly Showdown

Is Python Stuck Here?

The Real Lesson

Top comments (0)

What a + b Actually Costs

The Assembly Showdown

Is Python Stuck Here?

The Real Lesson

What `a + b` Actually Costs