Prajwal zore

Posted on May 3 • Edited on May 4

Inside My Custom malloc: Bins, tcache, mmap, and Thread Safety

#c #linux #architecture #performance

Most developers use malloc without thinking much about what happens underneath.

This project is an attempt to explore that layer by building a memory allocator from scratch in C.

The allocator implements malloc, free, calloc, and realloc without relying on libc’s heap functions. It focuses on:

Thread safety
Per-thread caching (tcache)
Efficient free block management using bins
mmap-based memory growth
Handling large allocations separately

This article breaks down the design, implementation decisions, performance characteristics, and limitations of the allocator.

What is a Memory Allocator?

A memory allocator is responsible for managing dynamic memory at runtime.

Functions like malloc, free, calloc, and realloc are part of this layer.

At a high level, an allocator:

Requests memory from the operating system (e.g., using mmap)
Splits that memory into smaller blocks
Tracks which blocks are free or in use
Reuses freed blocks to avoid unnecessary system calls

This layer sits between user programs and the OS, making memory allocation efficient and reusable.

Why Allocators Are Non-Trivial

A good allocator must balance multiple competing goals:

Performance → allocations should be fast
Memory efficiency → minimize fragmentation
Scalability → handle multi-threaded workloads
System overhead → reduce expensive syscalls

Modern allocators like those in libc (e.g., ptmalloc) are highly optimized and use techniques such as arenas, bins, and thread-local caching.

This project implements a simplified version of those ideas to understand how they work in practice.

Allocator Overview

At a high level, the allocator follows two distinct paths based on allocation size:

Small allocations (< 128KB) → handled through heap, bins, and per-thread cache
Large allocations (≥ 128KB) → handled using mmap

Allocation Flow

mymalloc(size) is called
Size is aligned to 8 bytes
If size < 128KB:
- Check per-thread tcache
- If hit → return immediately (no locking)
- If miss → acquire global heap lock
  - Search free bins
  - If not found → request space via mmap chunk
  - Split block if necessary
If size ≥ 128KB:
- Try large block cache
- Otherwise call mmap

Free Flow

If block belongs to heap:
- Push to thread-local tcache
- If tcache is full → flush to global bins + coalesce
If block is mmap’d:
- Store in large cache or release via munmap

Each allocation is preceded by a metadata header:

[ block_header | user_data ]

The header stores:

size
allocation state
mmap flag
pointers for heap and bin lists

┌──────────────────────────────┐
│ block_header_t               │  48 bytes
│  size_t size                 │
│  int isfree                  │
│  int ismmapped               │
│  int in_tcache               │
│  block_header_t *next/prev   │  heap linked list
│  block_header_t *bin_next    │  bin free list
│  block_header_t *bin_prev    │  bin free list
├──────────────────────────────┤
│ user data                    │  size bytes  ← returned pointer
└──────────────────────────────┘

Free blocks are organized into 8 bins based on size ranges which allows:

Faster lookup (O(1) class selection)
Reduced search overhead

Each thread maintains its own cache of free blocks.

which means every thread now has it's own temporary storage to keep the free blocks in it and access them as per need easily.

Benefits:

No locking on fast path
High cache locality
Significant performance boost in multi-threaded workloads

Free Block Bins

Free blocks are organized into 8 size-based bins.

This enables:

O(1) size class lookup
Reduced search overhead
Better reuse of memory blocks

Benchmarks

Tests were run on x86-64 Linux with 8 threads and -O2.

Test	Custom	libc	Result
Single alloc/free(1000k)	58ms	29ms	2x slower
Batch alloc(10k)	1.44ms	3.59ms	2.5x faster
Batch free(10k)	0.36ms	1.54ms	4x faster
Mixed sizes(100k)	6.46ms	2.95ms	2x slower
Realloc chain(100k)	6.42ms	2.56ms	2.5x slower
Multithreaded(8 threads-5k each)	64ms	67ms	Comparable

Looking at the performance stats, it becomes clear why libc’s allocator is so highly optimized—it’s the result of decades of engineering and refinement.
My alloctor has produced average results but While my allocator doesn’t match its performance, building it gave me a much deeper understanding of how memory allocators work, and what makes production-grade implementations fundamentally different.

Observations

Batch workloads benefit heavily from tcache
Large allocation cache reduces mmap calls significantly
Global lock limits scalability
libc remains more optimized for general workloads
Next, I plan to implement per-thread arenas to eliminate global lock contention.

Limitations

Single global lock limits scalability
No in-place realloc
No coalescing inside tcache
Large mmap blocks may waste memory
No cleanup on thread exit

Source Code

You can explore the full source here.

Final Thoughts

Building a memory allocator from scratch highlights the trade-offs between performance, complexity, and correctness.

Even a simplified allocator quickly grows in complexity when thread safety, caching, and fragmentation are considered.

If you have suggestions, optimizations, or questions, feel free to ask or start a discussion.

DEV Community