Your memory leak might be standard Python behavior

#backend #performance #python

TL;DR: If your garbage collector shows nothing wrong but memory keeps growing, you might not have a leak at all, you might have memory fragmentation. It happens because glibc creates separate memory arenas per thread, and a handful of live objects can prevent entire arenas from being returned to the OS. Switching to jemalloc fixed it for us.

Scenario

I had been working recently on debugging a memory leak in a Python web application built with Flask and SQLAlchemy and deployed in a Linux container in Kubernetes. Each deployment of the application lived for about 2 weeks, with a steady increase in memory that led to application crashes.

We had some asynchronous tasks that were intensive in resources, both cpu and memory, and our first approach was to use the garbage collector to inspect the number of objects during the lifecycle of the application. Our surprise was that we could not see any suspicious increase of instances of any class - all objects seemed to be properly collected and disposed after each task. Also, the pattern of memory increase was curious: when we ran a memory intensive task, there was a spike in memory, but after the spike there were a few dozens of megabytes that were not returned to the system (which after a few days was already in the hundreds).

At this point we knew that:

Objects were created and memory was taken from the system.
The garbage collector disposed of these objects when they were not referenced anymore.
Memory that had been taken from the system was not completely returned.

But Python is a garbage collected programming language used by millions of people daily, did these mean there was a bug in it? Seemed unlikely. So the next step was to delve into something I had never the need to think of: how does Python manage memory?

There are different levels in this topic, but what I was interested about was memory allocation, a topic I hadn't touched since my days in university where we were taught to code in C.

Short intro to Python memory management

Python differentiates between small objects (up to 512 bytes), like integers, and large objects, like HTTP response bodies or database query results. Small objects are handled by Python's own manager: PyMalloc. Big objects are directly handled by the system memory allocator, in our case
glibc.

PyMalloc organises memory in a strict three-level hierarchy: each 1 MiB arena contains 4 KiB pools, and each pool is divided into fixed-size blocks. When all blocks in an arena are free, that arena is returned to the OS, so a single live small object can pin a full 1 MiB arena in memory.

Arena (1 MiB on 64-bit systems)
 └── Pool (4 KiB)
      └── Block (8–512 bytes)

On the other hand, large objects are handled by glibc's malloc, whose internal structure is completely different, but the reclaim logic is similar in spirit: memory is returned to the OS when a region has no live objects. But glibc adds one complication: when Python is working with multiple threads, glibc creates additional arenas per thread, up to 8 × CPU cores on 64-bit systems. Each thread can accumulate its own arenas, and if a small fraction of objects remain live across many of them, none can be reclaimed.

One thing worth noting: Python's GIL prevents true parallel CPU execution across threads, but it doesn't affect how glibc assigns arenas. Arena assignment happens at the malloc call level, so each OS thread still acquires its own arena on first allocation, regardless of the GIL.

So, imagine we have an application that spawns multiple tasks, each on a different thread, that make a lot of web requests and create a lot of objects. This means a lot of small and big objects, which means a lot of arenas allocated both by PyMalloc and glibc for each thread. That's a lot of arenas. And if for some reason 99% of these objects are released but a 1% is still referenced and scattered across arenas, little memory might be released. This issue is called memory fragmentation, and it's common in long-lived (not only Python) applications that spawn multiple threads.

Solution

Before switching allocators, we tried two lighter interventions. First, we set MALLOC_ARENA_MAX=2 to cap the number of glibc arenas across all threads, a common first recommendation for this class of problem. Second, we tried calling malloc_trim(0) periodically to prompt glibc to release free memory at the top of the heap. Neither had a meaningful impact in our case.

That led us to replacing glibc with jemalloc.

Indeed, glibc is known to produce memory fragmentation in some scenarios - that's one of the reasons why Jason Evans created jemalloc, a memory allocator designed to address memory fragmentation and scalability issues in heavily multi-threaded environments.

And surprisingly (at least to me), it's very easy to replace glibc with jemalloc as the memory allocator used by Python (in a Linux container): just install jemalloc with apt (or your package manager), and set the environment variable LD_PRELOAD to its path in the system. For example:

sudo apt update
sudo apt install -y libjemalloc2
export LD_PRELOAD=/usr/lib/x86_64-linux-gnu/libjemalloc.so.2

Proper configuration of jemalloc is also needed. Usually, narenas is set to a low number to force different threads to share their arenas and reduce memory fragmentation, at the possible cost of some performance (negligible in our case). I will just leave here some config that might work in the scenario I described:

export MALLOC_CONF="narenas:1,tcache:false"

As almost everything in software, nothing is free, and glibc is so widely used because it works generally pretty well. With this configuration we are reducing memory fragmentation at the cost of more CPU usage, which in our case was negligible as might be as well in yours, if you are reading
these lines.

So, our fragmentation issue was solved with 3 new short lines in our Dockerfile. As a final word, I'd like to say that best jemalloc configuration depends heavily on you specific case. There are other variables that can be setup like dirty_decay_ms or background_thread that might be or not useful, so, always try with different configurations and find the one that yields the best performance!

Bonus

While investigating, I also found out that it is posible to disable PyMalloc entirely with PYTHONMALLOC=malloc. This routes all memory management (small objects included) through the system allocator (glibc or jemalloc, depending on what is configured).

export PYTHONMALLOC=malloc

I'm sharing this as a curiosity: in my case, performance noticeably degraded. PyMalloc is specifically optimised for Python's pattern of many small, short-lived objects, and is estimated to give a 15–20% performance improvement for typical workloads. Bypassing it pushes significantly more allocations into the system allocator, which is slower for that use case.

Unless you have strong evidence that PyMalloc itself is contributing to your problem, I would not recommend this.