Naz Quadri

Posted on Mar 31 • Originally published at nazquadri.dev

How Your Python Code Actually Runs

#linux #programming #systems #tutorial

How Your Python Code Actually Runs

From `x = 1 + 2` to Electrons, and Everything In Between

Reading time: ~14 minutes

You wrote x = 1 + 2. Python printed 3. You moved on.

Here's what you missed: that single line triggered a cascade of more than a dozen distinct mechanical processes across at least four layers of abstraction before a single electron moved. The result — 3 — landed in RAM after being processed by a lexer, a parser, a bytecode compiler, a virtual machine, an instruction decoder, a five-stage pipeline, a branch predictor, and three levels of cache. A translation lookaside buffer was consulted. Microcode was executed.

None of this is magic. It's engineering — old, clever engineering — and once you see it, you can't unsee it. Best of all we built all of this without the help of a single LLM, not one!

Why This Matters

If you've ever wondered why Python is slow, you've probably been told it's interpreted. That's not wrong, but it's the surface of a much weirder story.

If you've ever debugged a performance problem and found it in a hot loop you didn't write, you've encountered the cache hierarchy without knowing it — and understanding it is the difference between "I need a faster machine" and "I need to fix my data layout."

Always remember, we can't hardware our way out of a problem we softwared our way into.

Stage 1: The Text Is Not the Program

Your Python source file is bytes. Unicode bytes, specifically UTF-8. Before anything can run, Python has to read those bytes and figure out what they mean. That process starts with the lexer (sometimes called the tokenizer, not to be confused with an LLM tokenizer).

The lexer reads your source left to right and carves it into tokens — atomic units of meaning. For x = 1 + 2, the tokens are: NAME 'x', OP '=', NUMBER '1', OP '+', NUMBER '2', NEWLINE. That's it. Six tokens. No nesting, no hierarchy — just a flat stream.

The parser takes that stream of tokens and builds a tree. It applies the grammar rules of Python to understand structure: = is an assignment, 1 + 2 is an addition expression, and the whole thing is a statement. What comes out is an Abstract Syntax Tree — an AST — where the nodes are operations and the leaves are values.

# You can see Python's AST yourself
import ast
print(ast.dump(ast.parse("x = 1 + 2"), indent=2))

# Output (simplified):
# Module(body=[
#   Assign(
#     targets=[Name(id='x')],
#     value=BinOp(left=1, op=Add(), right=2)
#   )
# ])

The AST is abstract because it captures the structure of the program, not the exact characters. It doesn't care whether you wrote 1+2 or 1 + 2. Same tree.

Stage 2: The Bytecode Compiler

Python doesn't execute the AST directly. It compiles the AST into bytecode — a sequence of simple instructions for an imaginary machine called the CPython Virtual Machine.

For x = 1 + 2, CPython compiles something like:

LOAD_CONST   1        # push 1 onto the stack
LOAD_CONST   2        # push 2 onto the stack
BINARY_OP    +        # pop two values, add them, push result
STORE_NAME   x        # pop the result, bind it to name 'x'

This bytecode is what gets stored in the .pyc files in your __pycache__ directory. It's also what the dis module shows you if you call dis.dis(some_function). These aren't CPU instructions — they're instructions for CPython's virtual CPU.

The key insight: Python's "slow" reputation lives almost entirely here. Interpreting bytecode instruction-by-instruction in software costs roughly 10–50x compared to optimized native machine code for CPU-bound work. Each of those four instructions above involves a function call, reference counting checks, type dispatch, and dictionary lookups. We'll come back to why.

Stage 3: The VM Loop

CPython's bytecode evaluator is a giant switch statement inside a function called _PyEval_EvalFrameDefault. It's one of the most-read C functions in existence. The loop looks roughly like:

for (;;) {
    opcode = *next_instr++;
    switch (opcode) {
        case LOAD_CONST: ...
        case BINARY_OP:  ...
        case STORE_NAME: ...
        // ~150 more cases
    }
}

For every bytecode instruction, the VM fetches the opcode, dispatches to a handler, executes it, and loops. This is where Python spends most of its life — inside this loop, for every operation in your program, forever.

This is also where the big bad scary GIL lives 👹. The Global Interpreter Lock. Everyone has an opinion about the GIL; let's look at what it actually is.

The GIL: What It Actually Locks

The GIL is a mutex. One mutex. It protects CPython's internal state from concurrent modification by multiple threads.

Why does it exist? CPython manages memory with reference counting. Every Python object has a counter tracking how many references point to it. When you write y = x, the object's refcount goes up by one. When x goes out of scope, it goes down. When it hits zero, the object is freed.

Reference counts are integers. Integers are not inherently thread-safe. If two threads simultaneously try to decrement the same refcount, you can get a race condition that results in a double-free or a dangling pointer. That corrupts the heap. Your process crashes in a way that produces no useful error message.

The GIL prevents this by ensuring only one thread runs Python bytecode at a time. In Python 3.2+, the GIL uses a time-based model: every 5ms (configurable via sys.setswitchinterval()), the interpreter checks for GIL contention and releases to give another thread a turn. This is safe for CPython's internals. It also means Python threads genuinely cannot run CPU-bound code in parallel on a multi-core machine.

The GIL protects CPython's internal state — not your application's data. Two threads can still race on your own dict if you're not careful. The GIL is not a substitute for application-level locking.

That's why CPU-bound Python code doesn't benefit from threading. That's why multiprocessing uses separate processes (each with its own GIL). Python 3.12 added per-interpreter GILs (PEP 684) — subinterpreters within a single process, each with their own lock. Python 3.13 went further with experimental "free-threaded" mode (PEP 703) — no GIL at all, true multi-threaded parallelism. As of 3.14, free-threaded mode is advancing toward official support, and a new concurrent.interpreters module gives you per-interpreter GIL parallelism from the stdlib. The GIL story is moving fast.

Stage 4: The JIT Kicks In

Modern runtimes don't stop at bytecode interpretation. Just-In-Time compilation is what bridges the gap between "interpreted" and "native."

The idea: instead of interpreting bytecode over and over, watch which code paths get executed frequently — the hot paths — and compile those to actual machine code. Machine code runs directly on the CPU, no interpreter loop overhead.

The core idea is the same everywhere — V8 (JavaScript), HotSpot (Java), PyPy (Python) all do this. Every function has a counter. Each time it executes, the counter increments. Once it crosses a threshold — say, 50 executions — the runtime says "this is hot" and ships it to the JIT compiler. The JIT looks at the bytecode, sees what types have actually been passed, and generates optimized machine code for those specific types.

bytecode: BINARY_OP +
JIT says: "these two things have always been ints."
JIT emits: ADD reg1, reg2   ← one CPU instruction

The JIT can also inline function calls (paste the function body directly into the caller, eliminating call overhead), unbox objects (extract the raw integer from a wrapper), and perform dead code elimination (remove branches that never get taken). This is why microbenchmarks have to be written carefully — the first 50 calls are slow (interpreter), then the JIT fires and the next 50,000 are fast (native code). Measure the wrong phase and your numbers are meaningless.

CPython was late to this party. PyPy has had a mature tracing JIT since 2009. V8 has been doing it since 2008. CPython's first step was the specializing adaptive interpreter in 3.11 (PEP 659) — runtime optimization without emitting machine code. Then 3.13 added an experimental true JIT (PEP 744). It's still opt-in as of 3.14 (enable with PYTHON_JIT=1), but the direction is clear: CPython is finally compiling to metal.

Stage 5: Machine Code and the CPU

Now we leave the runtime and enter the hardware. A JIT has emitted machine code, or a C compiler compiled your C extension, or Python called into numpy which is all native code. Either way: we have actual CPU instructions now.

Let's take a simple addition: add two integers, store the result.

A CPU instruction is an encoding. On x86-64, ADD rax, rbx encodes as three bytes: 0x48 0x01 0xD8 (REX.W prefix, ADD opcode, ModRM byte). Those bytes are fetched from memory by the instruction fetch unit, decoded by the instruction decoder into an internal micro-operation, and dispatched through a five-stage pipeline:

  Stage 1        Stage 2       Stage 3       Stage 4       Stage 5
┌───────────┐  ┌──────────┐  ┌──────────┐  ┌──────────┐  ┌───────────┐
│ Fetch     │→ │ Decode   │→ │ Execute  │→ │ Memory   │→ │ Writeback │
│ get bytes │  │ parse op │  │ run ALU  │  │ load/str │  │ write reg │
└───────────┘  └──────────┘  └──────────┘  └──────────┘  └───────────┘

Each stage handles one instruction per clock cycle. While stage 3 is executing instruction N, stage 2 is decoding instruction N+1, and stage 1 is fetching instruction N+2. That's pipelining — you keep the hardware fed. (In practice, the Memory stage can take 4–100+ cycles for a cache miss — the pipeline stalls or the CPU finds other work to fill the gap.)

Branch Prediction: The CPU's Gamble

Here's where it gets weird.

The pipeline works great as long as instructions flow in order. But what about if statements? The CPU doesn't know which branch to take until it evaluates the condition — and the condition is in stage 3 while the next instruction is being fetched in stage 1. That's two clock cycles of uncertainty.

Modern CPUs solve this with branch prediction: the CPU guesses which branch will be taken, starts executing it speculatively, and if the guess was right, great. If wrong, it has to flush the pipeline — discard all the work in progress and restart from the correct branch. A misprediction costs 15–20 clock cycles.

The branch predictor is sophisticated. It learns patterns. An if in a loop that always takes the same path? Predictor learns it in 3–4 iterations and gets it right every time after that. A random branch? 50% miss rate. This is why branches on random data are expensive.

That's why x = 1 + 2 — completely predictable, no branches — runs at near-wire speed. And it's why sorting data before processing it can speed up the loop even when the sort itself costs time. (The famous Stack Overflow branch-prediction answer is worth reading if you haven't.)

Speculative Execution

The CPU doesn't stop at branch prediction. Modern out-of-order processors execute instructions before they know if those instructions will be needed.

Speculative execution means the CPU sees an if statement, starts executing both branches simultaneously (or at least the most likely one), and discards the results of the wrong one when the condition resolves. This uses a reorder buffer — a queue of in-flight instructions that haven't committed yet.

This is where Spectre and Meltdown came from. Disclosed in January 2018 by Jann Horn (Google Project Zero) and independent researchers, they were two distinct attacks that exploited speculative execution in different ways.

Meltdown (CVE-2017-5754): when the CPU speculatively executes a load from a kernel memory address — one your process isn't allowed to read — it raises a fault and discards the result. But before it discards the result, it has already placed that data through the cache. An attacker could craft a timing probe: speculatively read kernel memory, index into an array using the forbidden byte as an index, then measure which array element is now cached. Cache hits are fast. Cache misses are slow. The timing difference reveals the byte value. The speculative execution was rolled back; the cache state was not.

Spectre (CVE-2017-5753, CVE-2017-5715): subtler and harder to fix. Instead of crossing the kernel/user boundary, Spectre tricks a victim process's own speculative execution into leaking its data to the attacker. An attacker can train the branch predictor to mispredict a bounds check, causing the victim to speculatively access out-of-bounds memory, contaminating the cache with data the attacker can then probe via timing.

The fix for Meltdown was KPTI — kernel page-table isolation — which unmaps kernel pages from user-space page tables entirely. The CPU can't even speculatively load what isn't mapped. The cost: every syscall now requires a full CR3 register swap, which flushes the TLB. Some workloads saw 5–30% performance regressions after the patches.

I'm glossing over the reorder buffer, register renaming, and retirement logic that make speculative execution work. The point is: the CPU is running ahead of your program, guessing what's coming next, and these guesses left observable traces in shared hardware state.

It was 2018. Everyone had a bad year.

Stage 6: The Memory Hierarchy

Getting 3 into RAM isn't as simple as writing a number. Memory is a hierarchy, and the level you're reading from or writing to determines how long you wait.

Your CPU has registers — tiny, blazing-fast storage that's literally wired into the arithmetic units. Adding two numbers in registers takes one clock cycle. There are maybe 16 general-purpose registers on x86-64. Everything else has to come from further away.

L1 cache is on-die, per-core, about 32KB per half, latency ~4 cycles. It's split into two: L1i (instruction cache) holds the machine code bytes the CPU is about to execute, L1d (data cache) holds the data those instructions operate on. They're separate so the CPU can fetch the next instruction and load data for the current one simultaneously — no contention. L2 is larger, unified (instructions and data share it), slightly further away, ~12 cycles. L3 is shared between all cores, several megabytes, ~40 cycles. Main RAM is ~100 cycles. NVMe is 100,000+ cycles.

The CPU doesn't load individual bytes from RAM. It loads cache lines — 64 bytes at a time, aligned to 64-byte boundaries. If you read byte 0, you get bytes 0–63 in cache for free. If you then read byte 8, it's already there: 1 cycle. If you read byte 128 instead, you get a cache miss: 100 cycles for a new line.

This is why array iteration is fast and linked-list iteration is slow, even when both are O(n). Arrays are contiguous in memory — every element is either in the current cache line or the next one. Linked list nodes can be anywhere in the heap, scattered across thousands of cache lines. Each pointer dereference is a potential cache miss. In practice, iterating a linked list runs 5–10x slower than iterating a vector of the same size, and the gap widens as the list grows.

That's why "just use a faster computer" doesn't always help. If your hot loop is blowing out the L1 cache, a 4 GHz CPU vs a 3 GHz CPU buys you 33%. Fixing your data layout to be cache-friendly can buy you 10x.

For x = 1 + 2: the result lives in a register during computation. When CPython stores it, it writes into the Python object heap, which is almost certainly in L1 or L2 by the time you're done — this code is hot, and hot code keeps its data in cache.

The TLB: Your Address Book for Addresses

One more layer. The memory addresses in your program — the 0x7fff... values you see in stack traces — are virtual addresses. They don't correspond directly to physical RAM locations. Every process has its own virtual address space, and the operating system maps it to physical memory via page tables.

Every memory access requires translating a virtual address to a physical one. Page tables are stored in memory. If you did a full page table walk on every single memory access, you'd be spending 4x or more of your time doing address translation.

The hardware solution is the TLB — Translation Lookaside Buffer. It's a small, very fast cache of recent virtual→physical address mappings. A TLB hit costs ~1 cycle. A TLB miss requires a page table walk: 4 memory accesses on x86-64, potentially 400 cycles.

When a context switch happens — when the OS switches from your process to another — the TLB gets flushed (or tagged per-process, depending on the hardware). This is one of the hidden costs of context switches: the first few hundred memory accesses after a context switch are slow because the TLB is cold.

For x = 1 + 2 in a running program: the pages involved are hot, the TLB entries are present, no page table walk needed.

Putting It Together

Let's run the full trace. You hit Enter.

Python reads the .py file off disk (or from its own input buffer if you're in the REPL)
The lexer carves x = 1 + 2 into six tokens
The parser builds an AST: an Assign node with a BinOp on the right
The bytecode compiler emits four instructions: LOAD_CONST 1, LOAD_CONST 2, BINARY_OP +, STORE_NAME x
The eval loop fetches the first bytecode: LOAD_CONST 1
The eval loop checks the type of 1 (it's a PyLongObject), holds the GIL as always during bytecode execution, and pushes the object reference onto the value stack
Same for 2
BINARY_OP + pops both, checks their types via their tp_as_number->nb_add slot, calls the C function for integer addition
CPython creates a new PyLongObject with value 3, increments its reference count to 1, pushes it onto the stack
STORE_NAME x pops the value, looks up the current locals dict, and stores the reference there — the name 'x' now maps to the object
The underlying C addition instruction fetches both integer values from the PyLongObject structs (L1 cache hit — these are hot objects)
The x86 ADD instruction runs through fetch → decode → execute → writeback in the CPU pipeline
The branch predictor correctly predicted there was no overflow (there wasn't), so no exception path is taken
The 64-byte cache line containing the PyLongObject for 3 gets written to L1 cache
The TLB resolves the virtual address of the locals dict to its physical RAM location in ~1 cycle
Reference counting updates the old value of x (if any) — decrementing its refcount, potentially freeing it
The GIL is held throughout; no other Python thread ran during this assignment
Python's memory allocator (obmalloc) manages the PyLongObject allocation from its pool — no syscall needed for small objects
The assignment completes; the eval loop advances the instruction pointer to the next bytecode
The number 3 is now in RAM, bound to the name x in the frame's locals dictionary

Twenty steps. One line. 3 in RAM.

That's why Python feels slow compared to C — you're doing twenty of those steps for work that a C compiler would reduce to a single mov instruction.

That's why JITs work: once the runtime knows that x is always going to be an int and + is always integer addition, it can collapse steps 6 through 16 into two machine instructions and a register write.

DEV Community

How Your Python Code Actually Runs

How Your Python Code Actually Runs

From `x = 1 + 2` to Electrons, and Everything In Between

Why This Matters

Stage 1: The Text Is Not the Program

Stage 2: The Bytecode Compiler

Stage 3: The VM Loop

The GIL: What It Actually Locks

Stage 4: The JIT Kicks In

Stage 5: Machine Code and the CPU

Branch Prediction: The CPU's Gamble

Speculative Execution

Stage 6: The Memory Hierarchy

The TLB: Your Address Book for Addresses

Putting It Together

Further Reading

Top comments (0)

How Your Python Code Actually Runs

From x = 1 + 2 to Electrons, and Everything In Between

Why This Matters

Stage 1: The Text Is Not the Program

Stage 2: The Bytecode Compiler

Stage 3: The VM Loop

The GIL: What It Actually Locks

Stage 4: The JIT Kicks In

Stage 5: Machine Code and the CPU

Branch Prediction: The CPU's Gamble

Speculative Execution

Stage 6: The Memory Hierarchy

The TLB: Your Address Book for Addresses

Putting It Together

Further Reading

From `x = 1 + 2` to Electrons, and Everything In Between