DEV Community

Cover image for Inside My Custom malloc: Bins, tcache, mmap, and Thread Safety
Prajwal zore
Prajwal zore

Posted on

Inside My Custom malloc: Bins, tcache, mmap, and Thread Safety

Most developers use malloc without thinking much about what happens underneath.

This project is an attempt to explore that layer by building a memory allocator from scratch in C.

The allocator implements malloc, free, calloc, and realloc without relying on libc’s heap functions. It focuses on:

  • Thread safety
  • Per-thread caching (tcache)
  • Efficient free block management using bins
  • mmap-based memory growth
  • Handling large allocations separately

This article breaks down the design, implementation decisions, performance characteristics, and limitations of the allocator.


What is a Memory Allocator?

A memory allocator is responsible for managing dynamic memory at runtime.

Functions like malloc, free, calloc, and realloc are part of this layer.

At a high level, an allocator:

  • Requests memory from the operating system (e.g., using mmap)
  • Splits that memory into smaller blocks
  • Tracks which blocks are free or in use
  • Reuses freed blocks to avoid unnecessary system calls

This layer sits between user programs and the OS, making memory allocation efficient and reusable.


Why Allocators Are Non-Trivial

A good allocator must balance multiple competing goals:

  • Performance → allocations should be fast
  • Memory efficiency → minimize fragmentation
  • Scalability → handle multi-threaded workloads
  • System overhead → reduce expensive syscalls

Modern allocators like those in libc (e.g., ptmalloc) are highly optimized and use techniques such as arenas, bins, and thread-local caching.

This project implements a simplified version of those ideas to understand how they work in practice.


Allocator Overview

At a high level, the allocator follows two distinct paths based on allocation size:

  • Small allocations (< 128KB) → handled through heap, bins, and per-thread cache
  • Large allocations (≥ 128KB) → handled using mmap

Allocation Flow

flow chart

  1. mymalloc(size) is called
  2. Size is aligned to 8 bytes
  3. If size < 128KB:
    • Check per-thread tcache
    • If hit → return immediately (no locking)
    • If miss → acquire global heap lock
      • Search free bins
      • If not found → request space via mmap chunk
      • Split block if necessary
  4. If size ≥ 128KB:
    • Try large block cache
    • Otherwise call mmap

Free Flow

  1. If block belongs to heap:
    • Push to thread-local tcache
    • If tcache is full → flush to global bins + coalesce
  2. If block is mmap’d:
    • Store in large cache or release via munmap

Each allocation is preceded by a metadata header:

[ block_header | user_data ]

The header stores:

  • size
  • allocation state
  • mmap flag
  • pointers for heap and bin lists
┌──────────────────────────────┐
│ block_header_t               │  48 bytes
│  size_t size                 │
│  int isfree                  │
│  int ismmapped               │
│  block_header_t *next/prev   │  heap linked list
│  block_header_t *bin_next    │  bin free list
│  block_header_t *bin_prev    │  bin free list
├──────────────────────────────┤
│ user data                    │  size bytes  ← returned pointer
└──────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

Free blocks are organized into 8 bins based on size ranges which allows:

  • Faster lookup (O(1) class selection)
  • Reduced search overhead free chart

Each thread maintains its own cache of free blocks.

which means every thread now has it's own temporary storage to keep the free blocks in it and access them as per need easily.

Benefits:

  • No locking on fast path
  • High cache locality
  • Significant performance boost in multi-threaded workloads

tcache


Free Block Bins

Free blocks are organized into 8 size-based bins.

This enables:

  • O(1) size class lookup
  • Reduced search overhead
  • Better reuse of memory blocks

Benchmarks

Tests were run on x86-64 Linux with 8 threads and -O2.

Test Custom libc Result
Single alloc/free 58ms 29ms 2x slower
Batch alloc 1.44ms 3.59ms 2.5x faster
Batch free 0.36ms 1.54ms 4x faster
Mixed sizes 6.46ms 2.95ms 2x slower
Realloc chain 6.42ms 2.56ms 2.5x slower
Multithreaded 64ms 67ms Comparable

Observations

  • Batch workloads benefit heavily from tcache
  • Large allocation cache reduces mmap calls significantly
  • Global lock limits scalability
  • libc remains more optimized for general workloads
  • Next, I plan to implement per-thread arenas to eliminate global lock contention.

Limitations

  • Single global lock limits scalability
  • No in-place realloc
  • No coalescing inside tcache
  • Large mmap blocks may waste memory
  • No cleanup on thread exit

Source Code

You can explore the full source here.


Final Thoughts

Building a memory allocator from scratch highlights the trade-offs between performance, complexity, and correctness.

Even a simplified allocator quickly grows in complexity when thread safety, caching, and fragmentation are considered.

If you have suggestions, optimizations, or questions, feel free to ask or start a discussion.

Top comments (1)