malloc internals — Why free() Doesn't Return Memory to the System

#linux #c #performance #memory

Why Your Process's RSS Grows and Doesn't Shrink

The classic moment of confusion: a program allocates 2 GB, processes the data, frees everything via free() — and htop still shows 2 GB of resident memory (RSS). A leak? No. It's the designed behavior of the allocator, which most developers mistake for a bug because they don't understand what malloc() and free() actually do under the hood.

In the vast majority of cases, free() does not return memory to the operating system. It returns it to the allocator — the intermediate layer between your code and the kernel that manages memory pools to avoid costly syscalls on every allocation. Understanding this layer is the difference between "we have a memory leak" and "that's arena fragmentation, RSS won't shrink, but it isn't a leak."

This article dissects the internals of malloc in glibc: arenas, bins, chunks, the brk vs mmap boundary, the mechanics of fragmentation, and — crucial for production decisions — when jemalloc or tcmalloc beats the default glibc allocator.

The Intermediate Layer: Why We Don't Call the Kernel Directly

The kernel manages memory at page granularity (usually 4 KB). If every malloc(16) required a syscall to the kernel, the cost would be absurd — a syscall is ~100–500 ns, and a typical program does millions of allocations. The allocator solves this by requesting memory from the kernel in bulk and dividing it into small pieces in userspace.

Two syscalls supply memory from the kernel:

Mechanism	Action	Use in glibc malloc
`brk` / `sbrk`	Moves the data segment boundary (program break)	Small allocations (< 128 KB), main arena
`mmap`	Maps anonymous pages anywhere in the address space	Large allocations (≥ 128 KB) and thread arenas

# See both mechanisms in action — strace on a simple program
$ strace -e trace=brk,mmap ./program 2>&1 | head
brk(NULL)                = 0x55a3c2a00000        # current program break
brk(0x55a3c2a21000)      = 0x55a3c2a21000        # arena extension (132 KB)
mmap(NULL, 2101248, PROT_READ|PROT_WRITE, ...)   # large allocation (2 MB)
                         = 0x7f3a8c000000

The key threshold is M_MMAP_THRESHOLD (default 128 KB). Allocations below it come from the arena managed by brk; above it — directly via memory mapping (mmap). This distinction has fundamental consequences for whether memory ever returns to the system.

Anatomy of a Chunk — How malloc Remembers Sizes

When you call malloc(100), you get a pointer to 100 bytes — but the allocator reserved more. Every memory block is a chunk with a metadata header just before the returned pointer:

/* Simplified chunk structure in glibc (ptmalloc2) */
struct malloc_chunk {
    size_t      prev_size;  /* size of the previous chunk, IF free */
    size_t      size;       /* size of this chunk + 3 flag bits */

    /* The fields below are used ONLY when the chunk is free: */
    struct malloc_chunk *fd;  /* forward pointer in the free list */
    struct malloc_chunk *bk;  /* backward pointer in the free list */
};

/* The three lowest bits of 'size' are flags (chunks 8B-aligned): */
/* PREV_INUSE     (0x1) — whether the previous chunk is in use     */
/* IS_MMAPPED     (0x2) — whether the chunk comes from mmap         */
/* NON_MAIN_ARENA (0x4) — whether the chunk belongs to a thread arena */

A clever trick: when a chunk is in use, the fd/bk fields aren't needed, so that area holds user data. When the chunk is free, those same bytes store list pointers. This is why a use-after-free that overwrites a freed chunk corrupts the allocator's lists — it writes over fd/bk, which on the next allocation leads to a write to an arbitrary address (the classic fastbin dup exploitation vector).

Bins — The Structure for Managing Free Chunks

Freed chunks aren't immediately returned to the system. They go into bins — lists of free chunks grouped by size, so the next malloc() can quickly find a matching block. glibc maintains several bin categories with different characteristics:

Bin type	Chunk size	Characteristics
Fast bins	16–160 B (10 bins)	LIFO, no coalescing — fastest, single-linked
Tcache	per-thread, 24–1032 B	Per-thread cache (glibc 2.26+), lock-free
Small bins	< 512 B (62 bins)	FIFO, exact size, double-linked
Large bins	≥ 512 B	Sorted, size ranges, best-fit
Unsorted bin	any	Intermediate buffer before classification

Tcache (thread-local caching) is the most important optimization of recent years: each thread has its own pool of recently freed chunks, accessible without taking the arena lock. This dramatically speeds up multithreaded allocations, but also introduced new attack vectors (tcache poisoning).

Why free() Doesn't Return Memory

When you free a chunk, the allocator runs a sequence:

Checks if the chunk fits in the tcache → if so, drops it there (the fastest path)
If small → fast bin or small bin
Tries to coalesce with adjacent free chunks to counteract fragmentation
Only if a large, contiguous free area grows at the top of the arena (top chunk) — the allocator may call sbrk with a negative argument or malloc_trim() to return memory

The critical condition: memory returns to the system only when the free area is at the very end of the arena and exceeds the threshold (M_TRIM_THRESHOLD, default 128 KB). If a live chunk sits behind the freed area, brk can't retreat — the memory stays reserved, though free. This is fragmentation, not a leak.

#include <stdlib.h>
#include <malloc.h>

int main(void) {
    /* Allocate 1000 blocks of 1 KB */
    void *blocks[1000];
    for (int i = 0; i < 1000; i++)
        blocks[i] = malloc(1024);

    /* Free all the EVEN ones — a fragmentation checkerboard */
    for (int i = 0; i < 1000; i += 2)
        free(blocks[i]);

    /* RSS stays high: free chunks interleave with live ones,
       brk can't retreat because live chunks block the arena's end */

    /* Explicitly force memory return to the system */
    malloc_trim(0);   /* return everything possible from the arena's end */

    return 0;
}

mmap vs brk — Why Large Allocations Are Different

Allocations above M_MMAP_THRESHOLD go directly via mmap and have a fundamentally different return profile. Each such allocation is a separate mapping that free() immediately returns to the system via munmap — because it's an independent region, not part of a shared arena.

Property	`brk` arena (small)	`mmap` (large ≥128 KB)
Return to OS on free()	Rarely (only from arena's end)	Immediately (munmap)
Allocation cost	Low (usually no syscall)	High (syscall + page fault)
Fragmentation	Possible (shared arena)	None (isolated mappings)
Initialization	Memory may contain garbage	Always zeroed (kernel guarantees)

Hence the counterintuitive effect: a program doing many large allocations may have more stable RSS than one with a million small ones, because the large ones return to the system immediately while the small ones stay trapped in the arena. The threshold is also dynamic — glibc increases it when it detects the program freeing large blocks, to avoid costly mmap/munmap cycles.

Arenas — Multithreaded Scaling

In a multithreaded program, if all threads competed for a single arena, the lock on it would be a bottleneck. glibc solves this with multiple arenas: the main arena (on brk) plus additional thread arenas (on mmap), each with its own lock.

# The number of arenas is capped — default 8 × core count (64-bit)
$ echo $((8 * $(nproc)))
64

# Control via an environment variable
$ MALLOC_ARENA_MAX=2 ./program
# Limiting arenas reduces memory usage (fewer separate pools)
# at the cost of potential lock contention with many threads

# Diagnostics of all arenas' state
$ MALLOC_STATS=1 ./program 2>&1 | tail -20

The trade-off is direct: more arenas = less lock contention, but more fragmented, unreturned memory (each arena holds its own free chunks). For multithreaded, RSS-sensitive applications, MALLOC_ARENA_MAX=2 is a common first optimization shot. This also explains why the same program uses more memory on a 64-core machine than a 4-core one.

Alternative Allocators — Measurement, Not Ideology

The default ptmalloc2 in glibc is a general-purpose compromise. For specific workloads, specialized allocators offer a measurable advantage.

Allocator	Strength	Typical use
ptmalloc2 (glibc)	Universal, always available	Default, most applications
jemalloc	Low fragmentation, predictable RSS, profiling	Databases, long-running servers (Redis, FB)
tcmalloc (Google)	Very fast small allocations, great multithreaded	Google apps, allocation-intensive
mimalloc (Microsoft)	Newest, great speed/fragmentation balance	New projects, .NET runtime

# Swap the allocator WITHOUT recompilation — via LD_PRELOAD
$ LD_PRELOAD=/usr/lib/libjemalloc.so.2 ./server
# Same binary, different allocator — instant benchmark

# Compare RSS under load (same workload, different allocators)
$ /usr/bin/time -v ./server 2>&1 | grep "Maximum resident"
    Maximum resident set size (kbytes): 524288   # glibc
    Maximum resident set size (kbytes): 312456   # jemalloc — ~40% less

# Permanent swap: link at compile time
$ gcc program.c -ljemalloc -o program

The swap mechanism is LD_PRELOAD overriding the malloc/free symbols from libc — the same interposition mechanism that powers AddressSanitizer. Practical rule: jemalloc when you're fighting fragmentation and RSS on a long-running server, tcmalloc when the bottleneck is small-allocation throughput across many threads. Always, however, measure first — for many applications the default glibc is sufficient, and swapping is premature optimization.

Diagnostics — Tools

Tool	Use
`strace -e brk,mmap`	Which memory syscalls the program actually makes
`MALLOC_STATS=1`	glibc arena and bin statistics on exit
`malloc_info()`	Programmatic dump of allocator state (XML)
`valgrind --tool=massif`	Heap usage profile over time (heap profiler)
`jemalloc + jeprof`	Allocation profiling with function attribution
`/proc/<pid>/smaps`	Detailed RSS breakdown by mapping (heap, mmap, anon)
`cat /proc/<pid>/status`	VmRSS vs VmData — resident vs reserved

Conclusion: The Allocator as a Layer You Must Understand

malloc and free aren't thin wrappers over the kernel — they're a sophisticated memory management layer with arenas, bins, chunks, and anti-fragmentation strategies. An engineer who understands this layer doesn't panic at high RSS after free(), knows when it's fragmentation versus a real leak, and can deliberately match the allocator to the workload's profile.

The key understandings: free() returns memory to the allocator, not the system; the brk vs mmap boundary at 128 KB determines whether memory ever returns; tcache and multiple arenas speed up multithreading at the cost of RSS; and chunk metadata interleaves with data, which makes use-after-free a corruption vector. This knowledge turns "mysterious memory growth" into a predictable, measurable, and controllable aspect of the system.

RSS that doesn't drop after free() is most often not a bug in your code — it's the allocator doing exactly what it was designed to do. The question isn't "why didn't it free," but "is this fragmentation, or am I actually holding live references."