Most developers use malloc without thinking much about what happens underneath.
This project is an attempt to explore that layer by building a memory allocator from scratch in C.
The allocator implements malloc, free, calloc, and realloc without relying on libc’s heap functions. It focuses on:
- Thread safety
- Per-thread caching (tcache)
- Efficient free block management using bins
- mmap-based memory growth
- Handling large allocations separately
This article breaks down the design, implementation decisions, performance characteristics, and limitations of the allocator.
What is a Memory Allocator?
A memory allocator is responsible for managing dynamic memory at runtime.
Functions like malloc, free, calloc, and realloc are part of this layer.
At a high level, an allocator:
- Requests memory from the operating system (e.g., using
mmap) - Splits that memory into smaller blocks
- Tracks which blocks are free or in use
- Reuses freed blocks to avoid unnecessary system calls
This layer sits between user programs and the OS, making memory allocation efficient and reusable.
Why Allocators Are Non-Trivial
A good allocator must balance multiple competing goals:
- Performance → allocations should be fast
- Memory efficiency → minimize fragmentation
- Scalability → handle multi-threaded workloads
- System overhead → reduce expensive syscalls
Modern allocators like those in libc (e.g., ptmalloc) are highly optimized and use techniques such as arenas, bins, and thread-local caching.
This project implements a simplified version of those ideas to understand how they work in practice.
Allocator Overview
At a high level, the allocator follows two distinct paths based on allocation size:
- Small allocations (< 128KB) → handled through heap, bins, and per-thread cache
-
Large allocations (≥ 128KB) → handled using
mmap
Allocation Flow
-
mymalloc(size)is called - Size is aligned to 8 bytes
- If size < 128KB:
- Check per-thread tcache
- If hit → return immediately (no locking)
- If miss → acquire global heap lock
- Search free bins
- If not found → request space via mmap chunk
- Split block if necessary
- If size ≥ 128KB:
- Try large block cache
- Otherwise call
mmap
Free Flow
- If block belongs to heap:
- Push to thread-local tcache
- If tcache is full → flush to global bins + coalesce
- If block is mmap’d:
- Store in large cache or release via
munmap
- Store in large cache or release via
Each allocation is preceded by a metadata header:
[ block_header | user_data ]
The header stores:
- size
- allocation state
- mmap flag
- pointers for heap and bin lists
┌──────────────────────────────┐
│ block_header_t │ 48 bytes
│ size_t size │
│ int isfree │
│ int ismmapped │
│ block_header_t *next/prev │ heap linked list
│ block_header_t *bin_next │ bin free list
│ block_header_t *bin_prev │ bin free list
├──────────────────────────────┤
│ user data │ size bytes ← returned pointer
└──────────────────────────────┘
Free blocks are organized into 8 bins based on size ranges which allows:
Each thread maintains its own cache of free blocks.
which means every thread now has it's own temporary storage to keep the free blocks in it and access them as per need easily.
Benefits:
- No locking on fast path
- High cache locality
- Significant performance boost in multi-threaded workloads
Free Block Bins
Free blocks are organized into 8 size-based bins.
This enables:
- O(1) size class lookup
- Reduced search overhead
- Better reuse of memory blocks
Benchmarks
Tests were run on x86-64 Linux with 8 threads and -O2.
| Test | Custom | libc | Result |
|---|---|---|---|
| Single alloc/free | 58ms | 29ms | 2x slower |
| Batch alloc | 1.44ms | 3.59ms | 2.5x faster |
| Batch free | 0.36ms | 1.54ms | 4x faster |
| Mixed sizes | 6.46ms | 2.95ms | 2x slower |
| Realloc chain | 6.42ms | 2.56ms | 2.5x slower |
| Multithreaded | 64ms | 67ms | Comparable |
Observations
- Batch workloads benefit heavily from tcache
- Large allocation cache reduces mmap calls significantly
- Global lock limits scalability
- libc remains more optimized for general workloads
- Next, I plan to implement per-thread arenas to eliminate global lock contention.
Limitations
- Single global lock limits scalability
- No in-place realloc
- No coalescing inside tcache
- Large mmap blocks may waste memory
- No cleanup on thread exit
Source Code
You can explore the full source here.
Final Thoughts
Building a memory allocator from scratch highlights the trade-offs between performance, complexity, and correctness.
Even a simplified allocator quickly grows in complexity when thread safety, caching, and fragmentation are considered.
If you have suggestions, optimizations, or questions, feel free to ask or start a discussion.
Top comments (1)