DEV Community

Cover image for I Built malloc() from Scratch in C — Here’s What Went Wrong
Prajwal zore
Prajwal zore

Posted on

I Built malloc() from Scratch in C — Here’s What Went Wrong

Most of us use malloc() without thinking about what happens underneath.

I decided to implement my own memory allocator in C to understand it better. This wasn’t for production use, just to learn how allocation, fragmentation, and concurrency actually behave in practice.

I also benchmarked it against glibc’s malloc to see where it stands.


Implementation Overview

My allocator currently includes:

  • Thread-local cache
  • Free lists (bins) for different size ranges
  • Direct mmap for larger allocations
  • A custom realloc() implementation

Benchmark Results

glibc malloc

Single-threaded:

  • alloc + free (1M iterations): ~26 ms
  • batch alloc/free:3.40ms/0.95ms
  • mixed sizes: ~2.5 ms

Multi-threaded:

  • 8 threads: ~57 ms

My Allocator

Single-threaded:

  • alloc + free (1M iterations): ~83 ms
  • batch alloc/free:1.46ms/0.50ms (faster than glibc)
  • mixed sizes: ~126 ms

Multi-threaded:

  • 8 threads: ~791 ms

What Worked

Batch allocation and free operations were faster than glibc.

This likely comes from:

  • simpler logic in the fast path
  • low per-operation overhead

So in very controlled scenarios, a simple allocator can outperform a general-purpose one.


Where It Struggled

Mixed Allocation Sizes

Performance dropped heavily when handling mixed sizes.

The main issue was my bin design:

  • limited number of bins
  • coarse grouping of sizes

This leads to:

  • poor fit for requested sizes
  • more fragmentation
  • additional overhead during allocation

glibc avoids this with more refined size classes.


Multithreading

This was the biggest weakness.

Even with thread-local caches, I ran into issues:

  • shared access to heap structures
  • contention when falling back to global data

I tried:

  • global locks
  • per-bin locks

Both increased complexity, and debugging became harder.


realloc() Bug

The most difficult issue I faced was in realloc().

I initially made a mistake:

  • allocating a new block using the old size
  • instead of handling cases where the new size is smaller

This caused:

  • memory corruption
  • segmentation faults later in execution

The correct behavior:

  • if new_size <= old_size, shrink in place
  • only allocate a new block when expansion is required

Fixing this resolved the crashes.


Debugging Experience

At one point, I removed locking entirely because debugging became too difficult.

The issue turned out not to be concurrency, but incorrect logic in realloc().

Using gdb helped identify the exact failure point.

One key takeaway:

Allocator bugs often don’t crash immediately.
They corrupt memory and fail later, which makes debugging harder.


Key Takeaways

  • Simple designs can perform well in specific cases, but don’t scale
  • Handling mixed allocation sizes efficiently requires better size class design
  • Thread-local caching helps, but doesn’t eliminate shared state
  • Concurrency adds complexity, especially when debugging
  • Tools like gdb are essential for low-level debugging

Next Steps

If I continue working on this allocator, I plan to:

  • improve size class handling
  • introduce per-thread arenas
  • reduce contention in shared structures

Final Thoughts

This project gave me a much clearer understanding of:

  • how allocators manage memory
  • why fragmentation and contention matter
  • why production allocators are complex

It’s one thing to read about memory allocation, and another to implement it and deal with its edge cases.

If you're interested in systems programming, building a memory allocator is a worthwhile exercise.

Top comments (0)