James Lee

Posted on May 17

Python Memory Optimization: How CPython's Memory Pool Works

#architecture #computerscience #performance #python

Python's Memory Management Architecture

As we all know, computer hardware resources are managed by the operating system — memory is no exception. Applications request memory from the OS via system calls, and C library functions wrap those system calls into a general-purpose memory allocator, exposing the familiar malloc family of functions.

This general-purpose allocator implemented by the C library is an important dividing line — it sits at Layer 0 of the memory management hierarchy. Everything above it is application-level memory management; everything below it is the OS internals hidden beneath the surface.

┌─────────────────────────────────────────────┐
│  Layer 3: Object-specific allocators        │  e.g. float free-list cache
├─────────────────────────────────────────────┤
│  Layer 2: Unified object allocator          │  object.tp_alloc / PyObject_New
│           (PyMem_XXXX interface)            │
├─────────────────────────────────────────────┤  ← Python's own memory mgmt
│  Layer 1: Python Memory Pool (pymalloc)     │  ★ The focus of this article
│           handles allocations ≤ 512 bytes   │
├─────────────────────────────────────────────┤
│  Layer 0: C Library (glibc / malloc)        │  General-purpose allocator
├─────────────────────────────────────────────┤
│  Layer -1: OS Virtual Memory Manager        │  Page-table based, MMU involved
├─────────────────────────────────────────────┤
│  Layer -2: Physical Memory + Disk (swap)    │  Hardware level
└─────────────────────────────────────────────┘

The green zone (Layers 1–3) is Python's own memory management, consisting of 3 layers:

Layer 1: A memory allocator that intercepts all memory requests. Its core is the memory pool — the star of this article.
Layer 2: Built on top of the unified PyMem_XXXX interface from Layer 1, implements unified object memory allocation (object.tp_alloc).
Layer 3: Serves specific object types, such as the float free-object cache pool introduced in earlier chapters.

Why Doesn't Python Just Use `malloc` Directly?

There are several reasons CPython builds its own layer on top of malloc:

Introducing a memory pool absorbs the pressure from frequent object creation and destruction.
It minimizes memory fragmentation and improves memory utilization efficiency.
There are many malloc implementations, and their performance varies wildly across platforms.

The Challenge of Memory Fragmentation

Memory fragmentation is a major headache for classic memory allocators, and its consequences can be severe. Here's a typical example:

Total free memory: 1900K
┌──────┬──────┬──────┬──────┬──────┬──────┬──────┐
│ used │ 100K │ used │ 200K │ used │1600K │ used │
│      │ free │      │ free │      │ free │      │
└──────┴──────┴──────┴──────┴──────┴──────┴──────┘

Request: 1000K contiguous → ❌ FAIL
(1900K free in total, but no single contiguous block ≥ 1000K)

Even though there's still 1900K of free memory, it's scattered across a series of non-contiguous fragments — making it impossible to allocate even 1000K.

Root Cause Analysis

To solve the problem, we must first understand what causes it.

Applications request memory blocks of unpredictable sizes — some large, some small. The timing of frees is equally unpredictable. Classic allocators manage all sizes together and allocate in arrival order:

Step 1: Initial state — memory is contiguous

[ A | B | C | D | E |      free      ]

Step 2: Large blocks can be split and re-allocated

[ A | b1| b2| C | D | E |   free     ]
      ↑ B was split into b1, b2

Step 3: Earlier allocations are not necessarily freed first — holes appear

[ A |hole| b2| C |hole| E |  free    ]
       ↑ b1 freed    ↑ D freed

Step 4: Over time, fragmentation worsens

[hole|hole| b2|hole|hole| E |hole    ]

The root cause is clear: mixing different-sized allocations in the same memory region, then splitting large blocks for re-allocation, creates gaps that are too small and scattered to reuse.

Solution: Size-Segregated Memory Management

Once we've identified the root cause, the solution becomes obvious — divide memory into separate regions by block size, and manage each independently.

A simple example:

┌─────────────────┬─────────────────┬─────────────────┐
│   Small Region  │  Medium Region  │   Large Region  │
│   (e.g. ≤64B)   │  (65B – 256B)   │  (257B – 512B)  │
├────────┬────────┼────────┬────────┼────────┬────────┤
│ page 1 │ page 2 │ page 1 │ page 2 │ page 1 │ page 2 │
│[8B blk]│[8B blk]│[128B ] │[128B ] │[256B ] │[256B ] │
└────────┴────────┴────────┴────────┴────────┴────────┘

As shown, memory is divided into small, medium, and large regions. Each region consists of several memory pages, and each page is divided into uniform-sized blocks.

This way, small block allocations cannot fragment the large-block region.

Fragmentation within each region still exists, but those fragments are always reusable by the next same-size request. Furthermore, with an optimized allocation strategy, fragmentation can be further reduced. For example, new allocations are made from page 1 first — page 2 gradually empties out and can eventually be reclaimed entirely.

Inside the Python VM, objects are constantly being created and destroyed, triggering frequent memory requests and releases. These allocations are generally small but happen at very high frequency. Python specifically designed the memory pool to optimize this scenario.

The 512-Byte Threshold

Request size     →  Allocator used
─────────────────────────────────────────
0 bytes          →  malloc() directly
1 – 512 bytes    →  pymalloc memory pool  ✅
> 512 bytes      →  malloc() directly

Why Not One Pool Per Byte Size?

Python does not create a separate pool for every possible size (1–512). Reasons:

512 pool types would add enormous complexity.
More pool types = more overhead.
If a specific size is only requested once, the rest of that page's blocks are wasted.

Instead, Python uses 8-byte granularity, rounding up to the nearest multiple of 8. This gives 64 size classes:

Request Size	Block Size Allocated	Class Index
1 – 8	8	0
9 – 16	16	1
17 – 24	24	2
25 – 32	32	3
…	…	…
497–504	504	62
505–512	512	63

Memory Utilization

This approach sacrifices some memory utilization. For the 8-byte class, average utilization is (1+8)/2/8 ≈ 56.25%. However, averaged across all 64 size classes, utilization is actually quite high:

total_requested = 0
total_allocated = 0

for i in range(1, 513):
    total_requested += i
    total_allocated += (i + 7) // 8 * 8

print('{:.2f}%'.format(total_requested / total_allocated * 100))
# 98.65%

Overall memory utilization reaches 98.65% — excellent.

Memory Pool Implementation

pool — The Core Data Structure

Now let's look at the source code in Objects/obmalloc.c.

Note: On 64-bit systems, CPython actually uses 16-byte alignment (not 8-byte). For clarity, we'll use 8-byte in our diagrams (these macros are configurable anyway).

#if SIZEOF_VOID_P > 4
#define ALIGNMENT              16    /* must be 2^N */
#define ALIGNMENT_SHIFT         4
#else
#define ALIGNMENT               8    /* must be 2^N */
#define ALIGNMENT_SHIFT         3
#endif

A helper macro converts a class index to block size (e.g., class 1 → 16 bytes):

#define INDEX2SIZE(I) (((uint)(I) + 1) << ALIGNMENT_SHIFT)

Each memory page (pool) is 4KB:

#define SYSTEM_PAGE_SIZE   (4 * 1024)
#define POOL_SIZE          SYSTEM_PAGE_SIZE   // 4096 bytes

Python treats each memory page as a pool — a container of uniform-sized blocks. At the start of each page sits a pool_header struct:

struct pool_header {
    union { block *_padding;
            uint count; } ref;      /* number of allocated blocks   */
    block *freeblock;               /* pool's free list head        */
    struct pool_header *nextpool;   /* next pool of this size class */
    struct pool_header *prevpool;   /* previous pool of this class  */
    uint arenaindex;                /* index into arenas            */
    uint szidx;                     /* block size class index       */
    uint nextoffset;                /* offset to next virgin block  */
    uint maxnextoffset;             /* largest valid nextoffset     */
};

Key fields:

count — number of allocated blocks
freeblock — pointer to the head of the free block linked list
nextpool / prevpool — doubly-linked list pointers
szidx — size class index
nextoffset — offset to the next uninitialized block
maxnextoffset — maximum valid block offset

Pool Initialization and Block Allocation

When a new 4KB page is initialized, the first block is immediately allocated (since initialization is always triggered by an allocation request):

┌──────────────┬───────────────────────────────────────┐
│ pool_header  │ [blk0:alloc'd] [blk1] [blk2] [blk3]… │
│  (64 bytes)  │  ↑ already given out                  │
└──────────────┴───────────────────────────────────────┘
                 nextoffset points here ──────────────→

As more allocations come in, blocks are taken from the uninitialized (grey) area:

┌──────────────┬──────────────────────────────────────┐
│ pool_header  │ [alloc][alloc][alloc][blk3][blk4]…   │
└──────────────┴──────────────────────────────────────┘
                                  ↑ nextoffset

When a block is freed (e.g., block 0), Python inserts it at the head of the free list. Each free block stores a next pointer in its first bytes, pointing to the next free block:

freeblock
    │
    ▼
┌───────┐     ┌───────┐
│ blk0  │────▶│ blk2  │────▶ NULL
│(freed)│     │(freed)│
└───────┘     └───────┘

Pool States

A pool can be in one of three states:

┌─────────────────────────────────────────────────────────┐
│  empty  │  All blocks free. count=0. Can be reclaimed   │
│         │  or reused for a different size class.        │
├─────────────────────────────────────────────────────────┤
│  used   │  Some blocks allocated, some free.            │
│         │  Actively serves allocation requests.         │
├─────────────────────────────────────────────────────────┤
│  full   │  All blocks allocated. freeblock=NULL.        │
│         │  Python ignores it until a block is freed.    │
└─────────────────────────────────────────────────────────┘

Why do states matter? They determine Python's handling strategy:

empty: The page can be returned to the OS or cached for reuse.
full: Python sets it aside and ignores it.
used: Python organizes these into a doubly-linked circular list for active allocation.

The usedpools Array

Used Pool Linked List

used pools of the same size class are linked together via nextpool / prevpool into a doubly-linked circular list:

dummy node (virtual head)
     │
     ▼
┌─────────┐    ┌─────────┐    ┌─────────┐
│ pool_A  │◀──▶│ pool_B  │◀──▶│ pool_C  │
│ 16B cls │    │ 16B cls │    │ 16B cls │
│ (full-ish)   │ (half)  │    │ (almost │
│         │    │         │    │  empty) │
└─────────┘    └─────────┘    └─────────┘

An empty list has pool->nextpool == pool (points to itself).

Python always allocates from the first pool in the list. When a pool becomes full, it's removed from the list. When a freed block causes a full pool to become used again, it's re-inserted at the head — ensuring the fullest pools stay at the front, and the emptiest pools drift to the tail and eventually become empty.

After free: full → used, re-insert at head
┌──────────┐    ┌─────────┐    ┌──────────────┐
│ pool_new │◀──▶│ pool_A  │◀──▶│ pool_B(empty)│
│(re-added)│    │         │    │  → reclaimed │
└──────────┘    └─────────┘    └──────────────┘

Pool List Array (usedpools)

Python maintains one used-pool list per size class (64 total). Naively, this would require 64 full pool_header dummy nodes:

64 × 48 bytes = 3072 bytes ≈ 3KB

But Python's authors are more frugal. Since dummy nodes only use nextpool and prevpool, they pack them into a flat pointer array called usedpools — each dummy node becomes just 2 pointers:

usedpools array (128 pointers = 1KB):

index:  [0]  [1]  [2]  [3]  [4]  [5]  [6]  [7] ...
        ├─────────┤  ├─────────┤  ├─────────┤
        │ class 0 │  │ class 1 │  │ class 2 │  ...
        │ (8B)    │  │ (16B)   │  │ (24B)   │
        └─────────┘  └─────────┘  └─────────┘

The trick: treat the address of usedpools[i] as if it were a pool_header*. Since only nextpool and prevpool are ever accessed on dummy nodes, the out-of-bounds access to other fields never actually happens.

Result: 64 × 16 bytes = 1024 bytes = 1KB — saving two-thirds of the memory compared to the naive approach.

DEV Community

Python Memory Optimization: How CPython's Memory Pool Works

Python's Memory Management Architecture

Why Doesn't Python Just Use `malloc` Directly?

The Challenge of Memory Fragmentation

Root Cause Analysis

Solution: Size-Segregated Memory Management

The 512-Byte Threshold

Why Not One Pool Per Byte Size?

Memory Utilization

Memory Pool Implementation

pool — The Core Data Structure

Pool Initialization and Block Allocation

Pool States

The usedpools Array

Used Pool Linked List

Pool List Array (usedpools)

Top comments (0)

Python's Memory Management Architecture

Why Doesn't Python Just Use malloc Directly?

The Challenge of Memory Fragmentation

Root Cause Analysis

Solution: Size-Segregated Memory Management

The 512-Byte Threshold

Why Not One Pool Per Byte Size?

Memory Utilization

Memory Pool Implementation

pool — The Core Data Structure

Pool Initialization and Block Allocation

Pool States

The usedpools Array

Used Pool Linked List

Pool List Array (usedpools)

Why Doesn't Python Just Use `malloc` Directly?