DEV Community: Prajwal zore

Inside My Custom malloc: Bins, tcache, mmap, and Thread Safety

Prajwal zore — Sun, 03 May 2026 10:52:51 +0000

Most developers use malloc without thinking much about what happens underneath.

This project is an attempt to explore that layer by building a memory allocator from scratch in C.

The allocator implements malloc, free, calloc, and realloc without relying on libc’s heap functions. It focuses on:

Thread safety
Per-thread caching (tcache)
Efficient free block management using bins
mmap-based memory growth
Handling large allocations separately

This article breaks down the design, implementation decisions, performance characteristics, and limitations of the allocator.

What is a Memory Allocator?

A memory allocator is responsible for managing dynamic memory at runtime.

Functions like malloc, free, calloc, and realloc are part of this layer.

At a high level, an allocator:

Requests memory from the operating system (e.g., using mmap)
Splits that memory into smaller blocks
Tracks which blocks are free or in use
Reuses freed blocks to avoid unnecessary system calls

This layer sits between user programs and the OS, making memory allocation efficient and reusable.

Why Allocators Are Non-Trivial

A good allocator must balance multiple competing goals:

Performance → allocations should be fast
Memory efficiency → minimize fragmentation
Scalability → handle multi-threaded workloads
System overhead → reduce expensive syscalls

Modern allocators like those in libc (e.g., ptmalloc) are highly optimized and use techniques such as arenas, bins, and thread-local caching.

This project implements a simplified version of those ideas to understand how they work in practice.

Allocator Overview

At a high level, the allocator follows two distinct paths based on allocation size:

Small allocations (< 128KB) → handled through heap, bins, and per-thread cache
Large allocations (≥ 128KB) → handled using mmap

Allocation Flow

mymalloc(size) is called
Size is aligned to 8 bytes
If size < 128KB:
- Check per-thread tcache
- If hit → return immediately (no locking)
- If miss → acquire global heap lock
  - Search free bins
  - If not found → request space via mmap chunk
  - Split block if necessary
If size ≥ 128KB:
- Try large block cache
- Otherwise call mmap

Free Flow

If block belongs to heap:
- Push to thread-local tcache
- If tcache is full → flush to global bins + coalesce
If block is mmap’d:
- Store in large cache or release via munmap

Each allocation is preceded by a metadata header:

[ block_header | user_data ]

The header stores:

size
allocation state
mmap flag
pointers for heap and bin lists

┌──────────────────────────────┐
│ block_header_t               │  48 bytes
│  size_t size                 │
│  int isfree                  │
│  int ismmapped               │
│  int in_tcache               │
│  block_header_t *next/prev   │  heap linked list
│  block_header_t *bin_next    │  bin free list
│  block_header_t *bin_prev    │  bin free list
├──────────────────────────────┤
│ user data                    │  size bytes  ← returned pointer
└──────────────────────────────┘

Free blocks are organized into 8 bins based on size ranges which allows:

Faster lookup (O(1) class selection)
Reduced search overhead

Each thread maintains its own cache of free blocks.

which means every thread now has it's own temporary storage to keep the free blocks in it and access them as per need easily.

Benefits:

No locking on fast path
High cache locality
Significant performance boost in multi-threaded workloads

Free Block Bins

Free blocks are organized into 8 size-based bins.

This enables:

O(1) size class lookup
Reduced search overhead
Better reuse of memory blocks

Benchmarks

Tests were run on x86-64 Linux with 8 threads and -O2.

Test	Custom	libc	Result
Single alloc/free(1000k)	58ms	29ms	2x slower
Batch alloc(10k)	1.44ms	3.59ms	2.5x faster
Batch free(10k)	0.36ms	1.54ms	4x faster
Mixed sizes(100k)	6.46ms	2.95ms	2x slower
Realloc chain(100k)	6.42ms	2.56ms	2.5x slower
Multithreaded(8 threads-5k each)	64ms	67ms	Comparable

Looking at the performance stats, it becomes clear why libc’s allocator is so highly optimized—it’s the result of decades of engineering and refinement.
My alloctor has produced average results but While my allocator doesn’t match its performance, building it gave me a much deeper understanding of how memory allocators work, and what makes production-grade implementations fundamentally different.

Observations

Batch workloads benefit heavily from tcache
Large allocation cache reduces mmap calls significantly
Global lock limits scalability
libc remains more optimized for general workloads
Next, I plan to implement per-thread arenas to eliminate global lock contention.

Limitations

Single global lock limits scalability
No in-place realloc
No coalescing inside tcache
Large mmap blocks may waste memory
No cleanup on thread exit

Source Code

You can explore the full source here.

Final Thoughts

Building a memory allocator from scratch highlights the trade-offs between performance, complexity, and correctness.

Even a simplified allocator quickly grows in complexity when thread safety, caching, and fragmentation are considered.

If you have suggestions, optimizations, or questions, feel free to ask or start a discussion.

I Built malloc() from Scratch in C — Here’s What Went Wrong

Prajwal zore — Sun, 26 Apr 2026 17:32:19 +0000

Most of us use malloc() without thinking about what happens underneath.

I decided to implement my own memory allocator in C to understand it better. This wasn’t for production use, just to learn how allocation, fragmentation, and concurrency actually behave in practice.

I also benchmarked it against glibc’s malloc to see where it stands.

Implementation Overview

(checkout entire implementation here)
My allocator currently includes:

Thread-local cache
Free lists (bins) for different size ranges
Direct mmap for larger allocations
A custom realloc() implementation

Benchmark Results

glibc malloc

Single-threaded:

alloc + free (1M iterations): ~26 ms
batch alloc/free:3.40ms/0.95ms
mixed sizes: ~2.5 ms

Multi-threaded:

8 threads: ~57 ms

My Allocator

Single-threaded:

alloc + free (1M iterations): ~83 ms
batch alloc/free:1.46ms/0.50ms (faster than glibc)
mixed sizes: ~126 ms

Multi-threaded:

8 threads: ~791 ms

What Worked

Batch allocation and free operations were faster than glibc.

This likely comes from:

simpler logic in the fast path
low per-operation overhead

So in very controlled scenarios, a simple allocator can outperform a general-purpose one.

Where It Struggled

Mixed Allocation Sizes

Performance dropped heavily when handling mixed sizes.

The main issue was my bin design:

limited number of bins
coarse grouping of sizes

This leads to:

poor fit for requested sizes
more fragmentation
additional overhead during allocation

glibc avoids this with more refined size classes.

Multithreading

This was the biggest weakness.

Even with thread-local caches, I ran into issues:

shared access to heap structures
contention when falling back to global data

I tried:

global locks
per-bin locks

Both increased complexity, and debugging became harder.

`realloc()` Bug

The most difficult issue I faced was in realloc().

I initially made a mistake:

allocating a new block using the old size
instead of handling cases where the new size is smaller

This caused:

memory corruption
segmentation faults later in execution

The correct behavior:

if new_size <= old_size, shrink in place
only allocate a new block when expansion is required

Fixing this resolved the crashes.

Debugging Experience

At one point, I removed locking entirely because debugging became too difficult.

The issue turned out not to be concurrency, but incorrect logic in realloc().

Using gdb helped identify the exact failure point.

One key takeaway:

Allocator bugs often don’t crash immediately.
They corrupt memory and fail later, which makes debugging harder.

Key Takeaways

Simple designs can perform well in specific cases, but don’t scale
Handling mixed allocation sizes efficiently requires better size class design
Thread-local caching helps, but doesn’t eliminate shared state
Concurrency adds complexity, especially when debugging
Tools like gdb are essential for low-level debugging

Next Steps

If I continue working on this allocator, I plan to:

improve size class handling
introduce per-thread arenas
reduce contention in shared structures

Final Thoughts

This project gave me a much clearer understanding of:

how allocators manage memory
why fragmentation and contention matter
why production allocators are complex
why thread safety is important

It’s one thing to read about memory allocation, and another to implement it and deal with its edge cases.

If you're interested in systems programming, building a memory allocator is a worthwhile exercise.

I Built a Simple Log Aggregation and Analytics tool

Prajwal zore — Wed, 04 Mar 2026 08:40:43 +0000

I Built StackLens — A Simple Log Aggregation Dashboard

Logs are one of the first things developers check when something goes wrong. But when logs come from multiple services, they quickly become hard to manage.

To explore how centralized logging works, I built StackLens — a full-stack log aggregation dashboard.

The idea is simple: services send logs to a backend API, the logs are stored in PostgreSQL, and a web dashboard lets you search, filter, and analyze them.

Tech Stack

I started with PERN stack , i don't know much about technologies yet ,still exploring though.... but according to you what stack is best suitable for this that if we think about scaling this project ? i would love to hear your opinion.

Frontend

React
TypeScript
TailwindCSS
TanStack Query
React Router

Backend

Node.js
Express
PostgreSQL
Supabase (hosted Postgres)

What schema i used ?

Each log entry looks like this:

{
  "service": "auth-service",
  "level": "info",
  "message": "User authenticated successfully",
  "metadata": { "ip": "192.168.1.10" },
  "timestamp": "2026-03-02T11:39:16Z"
}

Why I Built This

The goal of this project was to practice building a real-world style dashboard that combines:

backend APIs
database design
frontend data fetching
responsive UI design

Even though it's a simplified system, it helped me understand how log monitoring tools work behind the scenes.

What I Learned

Building StackLens helped me practice:

designing filtering APIs
using PostgreSQL enums and JSONB
managing server state with TanStack Query
building responsive dashboard layouts

If you're interested, you can check out the project here:

GitHub: (https://github.com/Whitfrost21/StackLens)

Thanks for reading!, love to hear if you build something similar.

Chatgpt is heavier or is that a bug ?

Prajwal zore — Sun, 01 Mar 2026 15:28:13 +0000

hello devs,
i was recently using chatgpt for a while and i got a surprise that when i try to open and start chats old chats which have a lot of conversations in it of course and suddenly my entire RAM was on the fire that my laptop started to freeze down. may be this is because chatgpt's DOM ,but they must optimize.
Also switching chats is good option but it still misses topic and goes out of context, better if they provide a summary option which help keeping track of context. What do you think ?

DEV Community: Prajwal zore

Inside My Custom malloc: Bins, tcache, mmap, and Thread Safety

What is a Memory Allocator?

Why Allocators Are Non-Trivial

Allocator Overview

Allocation Flow

Free Flow

Each allocation is preceded by a metadata header:

Free blocks are organized into 8 bins based on size ranges which allows:

Each thread maintains its own cache of free blocks.

Free Block Bins

Benchmarks

Observations

Limitations

Source Code

Final Thoughts

I Built malloc() from Scratch in C — Here’s What Went Wrong

Implementation Overview

Benchmark Results

glibc malloc

My Allocator

What Worked

Where It Struggled

Mixed Allocation Sizes

Multithreading

realloc() Bug

Debugging Experience

Key Takeaways

Next Steps

Final Thoughts

I Built a Simple Log Aggregation and Analytics tool

I Built StackLens — A Simple Log Aggregation Dashboard

Tech Stack

What schema i used ?

Why I Built This

What I Learned

Chatgpt is heavier or is that a bug ?

`realloc()` Bug