Prajwal zore

Posted on Apr 26 • Edited on May 3

I Built malloc() from Scratch in C — Here’s What Went Wrong

#c #systems #linux #performance

Most of us use malloc() without thinking about what happens underneath.

I decided to implement my own memory allocator in C to understand it better. This wasn’t for production use, just to learn how allocation, fragmentation, and concurrency actually behave in practice.

I also benchmarked it against glibc’s malloc to see where it stands.

Implementation Overview

(checkout entire implementation here)
My allocator currently includes:

Thread-local cache
Free lists (bins) for different size ranges
Direct mmap for larger allocations
A custom realloc() implementation

Benchmark Results

glibc malloc

Single-threaded:

alloc + free (1M iterations): ~26 ms
batch alloc/free:3.40ms/0.95ms
mixed sizes: ~2.5 ms

Multi-threaded:

8 threads: ~57 ms

My Allocator

Single-threaded:

alloc + free (1M iterations): ~83 ms
batch alloc/free:1.46ms/0.50ms (faster than glibc)
mixed sizes: ~126 ms

Multi-threaded:

8 threads: ~791 ms

What Worked

Batch allocation and free operations were faster than glibc.

This likely comes from:

simpler logic in the fast path
low per-operation overhead

So in very controlled scenarios, a simple allocator can outperform a general-purpose one.

Where It Struggled

Mixed Allocation Sizes

Performance dropped heavily when handling mixed sizes.

The main issue was my bin design:

limited number of bins
coarse grouping of sizes

This leads to:

poor fit for requested sizes
more fragmentation
additional overhead during allocation

glibc avoids this with more refined size classes.

Multithreading

This was the biggest weakness.

Even with thread-local caches, I ran into issues:

shared access to heap structures
contention when falling back to global data

I tried:

global locks
per-bin locks

Both increased complexity, and debugging became harder.

`realloc()` Bug

The most difficult issue I faced was in realloc().

I initially made a mistake:

allocating a new block using the old size
instead of handling cases where the new size is smaller

This caused:

memory corruption
segmentation faults later in execution

The correct behavior:

if new_size <= old_size, shrink in place
only allocate a new block when expansion is required

Fixing this resolved the crashes.

Debugging Experience

At one point, I removed locking entirely because debugging became too difficult.

The issue turned out not to be concurrency, but incorrect logic in realloc().

Using gdb helped identify the exact failure point.

One key takeaway:

Allocator bugs often don’t crash immediately.
They corrupt memory and fail later, which makes debugging harder.

Key Takeaways

Simple designs can perform well in specific cases, but don’t scale
Handling mixed allocation sizes efficiently requires better size class design
Thread-local caching helps, but doesn’t eliminate shared state
Concurrency adds complexity, especially when debugging
Tools like gdb are essential for low-level debugging

Next Steps

If I continue working on this allocator, I plan to:

improve size class handling
introduce per-thread arenas
reduce contention in shared structures

Final Thoughts

This project gave me a much clearer understanding of:

how allocators manage memory
why fragmentation and contention matter
why production allocators are complex
why thread safety is important

It’s one thing to read about memory allocation, and another to implement it and deal with its edge cases.

If you're interested in systems programming, building a memory allocator is a worthwhile exercise.

DEV Community

I Built malloc() from Scratch in C — Here’s What Went Wrong

Implementation Overview