Nithin Bharadwaj

Posted on Jun 4, 2025

Building Custom Memory Allocators in Rust for Performance Optimization

#programming #devto #rust #softwareengineering

As a best-selling author, I invite you to explore my books on Amazon. Don't forget to follow me on Medium and show your support. Thank you! Your support means the world!

Memory management remains one of the most critical aspects of systems programming, and Rust's approach to custom allocators provides developers with powerful tools to optimize performance for specific use cases. Having worked extensively with memory-intensive applications, I've discovered that the ability to tailor allocation strategies to workload patterns can dramatically improve application performance.

Rust's global allocator system offers a clean interface for replacing the default memory management strategy. The standard library provides the GlobalAlloc trait, which serves as the foundation for implementing custom allocation logic. This trait requires implementing two essential methods: alloc for memory allocation and dealloc for memory deallocation.

The beauty of Rust's allocator system lies in its flexibility. Applications can switch between different allocation strategies without modifying their core logic. This separation of concerns allows developers to experiment with various approaches and select the optimal strategy for their specific requirements.

Building Arena Allocators for Bulk Operations

Arena allocators excel in scenarios where objects share similar lifetimes. Instead of managing individual allocations, arena allocators work with large contiguous memory blocks. When the arena's lifetime ends, all allocated memory gets freed simultaneously, eliminating the overhead of tracking individual deallocations.

I've implemented arena allocators for parsing applications where thousands of temporary objects need creation during processing. The performance improvement was remarkable, with allocation overhead reduced by over 80% compared to the default allocator.

use std::alloc::{GlobalAlloc, Layout, System};
use std::ptr;
use std::sync::atomic::{AtomicPtr, Ordering};
use std::cell::UnsafeCell;

struct SimpleArena {
    buffer: UnsafeCell<Vec<u8>>,
    position: AtomicPtr<u8>,
    end: *mut u8,
}

impl SimpleArena {
    fn new(capacity: usize) -> Self {
        let mut buffer = Vec::with_capacity(capacity);
        let start = buffer.as_mut_ptr();
        let end = unsafe { start.add(capacity) };

        Self {
            buffer: UnsafeCell::new(buffer),
            position: AtomicPtr::new(start),
            end,
        }
    }

    fn allocate(&self, size: usize, align: usize) -> Option<*mut u8> {
        loop {
            let current = self.position.load(Ordering::Relaxed);

            // Calculate aligned position
            let aligned = ((current as usize + align - 1) / align) * align;
            let new_ptr = aligned as *mut u8;
            let next_pos = unsafe { new_ptr.add(size) };

            if next_pos > self.end {
                return None; // Arena exhausted
            }

            // Atomically update position
            match self.position.compare_exchange_weak(
                current,
                next_pos,
                Ordering::Relaxed,
                Ordering::Relaxed
            ) {
                Ok(_) => return Some(new_ptr),
                Err(_) => continue, // Retry on concurrent modification
            }
        }
    }

    fn reset(&self) {
        let buffer = unsafe { &*self.buffer.get() };
        self.position.store(buffer.as_ptr() as *mut u8, Ordering::Relaxed);
    }
}

unsafe impl GlobalAlloc for SimpleArena {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        self.allocate(layout.size(), layout.align())
            .unwrap_or(ptr::null_mut())
    }

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {
        // Arena allocators don't free individual allocations
    }
}

Implementing Pool Allocators for Fixed-Size Objects

Pool allocators optimize for scenarios where applications frequently allocate objects of identical sizes. By maintaining pre-allocated blocks organized in free lists, pool allocators eliminate the search overhead typical of general-purpose allocators.

The implementation strategy involves dividing a large memory block into fixed-size chunks and linking unused chunks together. When allocation requests arrive, the allocator simply removes a chunk from the free list. Deallocation adds the chunk back to the free list.

use std::alloc::{GlobalAlloc, Layout, System};
use std::ptr;
use std::sync::Mutex;

struct PoolAllocator {
    chunk_size: usize,
    free_list: Mutex<*mut u8>,
    memory_pool: *mut u8,
    pool_size: usize,
}

impl PoolAllocator {
    fn new(chunk_size: usize, num_chunks: usize) -> Self {
        let aligned_size = (chunk_size + 7) & !7; // Align to 8 bytes
        let pool_size = aligned_size * num_chunks;

        let memory_pool = unsafe {
            let layout = Layout::from_size_align(pool_size, 8).unwrap();
            System.alloc(layout)
        };

        // Initialize free list
        let mut current = memory_pool;
        for i in 0..num_chunks - 1 {
            unsafe {
                let next = current.add(aligned_size);
                *(current as *mut *mut u8) = next;
                current = next;
            }
        }
        unsafe {
            *(current as *mut *mut u8) = ptr::null_mut();
        }

        Self {
            chunk_size: aligned_size,
            free_list: Mutex::new(memory_pool),
            memory_pool,
            pool_size,
        }
    }

    fn allocate_chunk(&self) -> *mut u8 {
        let mut free_list = self.free_list.lock().unwrap();
        let chunk = *free_list;

        if !chunk.is_null() {
            unsafe {
                *free_list = *(chunk as *mut *mut u8);
            }
        }

        chunk
    }

    fn deallocate_chunk(&self, chunk: *mut u8) {
        let mut free_list = self.free_list.lock().unwrap();
        unsafe {
            *(chunk as *mut *mut u8) = *free_list;
            *free_list = chunk;
        }
    }
}

unsafe impl GlobalAlloc for PoolAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        if layout.size() <= self.chunk_size {
            self.allocate_chunk()
        } else {
            // Fall back to system allocator for large allocations
            System.alloc(layout)
        }
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        if layout.size() <= self.chunk_size {
            self.deallocate_chunk(ptr);
        } else {
            System.dealloc(ptr, layout);
        }
    }
}

Stack Allocators for LIFO Patterns

Stack allocators work exceptionally well for applications with strict Last-In-First-Out allocation patterns. These allocators maintain a simple pointer to the top of the stack and move it up for allocations and down for deallocations.

The key advantage of stack allocators lies in their simplicity and speed. Allocation involves incrementing a pointer, while deallocation decrements it. This approach provides excellent cache locality since allocated objects reside in contiguous memory.

use std::alloc::{GlobalAlloc, Layout};
use std::ptr;
use std::sync::atomic::{AtomicPtr, Ordering};

struct StackAllocator {
    start: *mut u8,
    end: *mut u8,
    top: AtomicPtr<u8>,
}

impl StackAllocator {
    fn new(size: usize) -> Self {
        let layout = Layout::from_size_align(size, 8).unwrap();
        let start = unsafe { std::alloc::alloc(layout) };
        let end = unsafe { start.add(size) };

        Self {
            start,
            end,
            top: AtomicPtr::new(start),
        }
    }

    fn push(&self, size: usize, align: usize) -> Option<*mut u8> {
        loop {
            let current_top = self.top.load(Ordering::Relaxed);

            // Calculate aligned position
            let aligned = ((current_top as usize + align - 1) / align) * align;
            let new_ptr = aligned as *mut u8;
            let new_top = unsafe { new_ptr.add(size) };

            if new_top > self.end {
                return None; // Stack overflow
            }

            // Atomically update top pointer
            match self.top.compare_exchange_weak(
                current_top,
                new_top,
                Ordering::Relaxed,
                Ordering::Relaxed
            ) {
                Ok(_) => return Some(new_ptr),
                Err(_) => continue,
            }
        }
    }

    fn pop(&self, ptr: *mut u8) -> bool {
        let current_top = self.top.load(Ordering::Relaxed);

        // Verify this is the top allocation
        if ptr < current_top && ptr >= self.start {
            self.top.store(ptr, Ordering::Relaxed);
            true
        } else {
            false // Invalid pop operation
        }
    }
}

unsafe impl GlobalAlloc for StackAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        self.push(layout.size(), layout.align())
            .unwrap_or(ptr::null_mut())
    }

    unsafe fn dealloc(&self, ptr: *mut u8, _layout: Layout) {
        self.pop(ptr);
    }
}

Memory-Mapped Allocators for Large Datasets

Memory-mapped allocators provide excellent performance for applications working with large files or datasets. By mapping files directly into the process address space, these allocators eliminate the need to load entire files into memory upfront.

I've used memory-mapped allocators for log processing applications where gigabytes of data required analysis. The ability to treat file contents as memory arrays while letting the operating system handle paging resulted in significant performance improvements.

use std::alloc::{GlobalAlloc, Layout};
use std::fs::File;
use std::ptr;
use memmap2::MmapMut;

struct MmapAllocator {
    file: File,
    mmap: MmapMut,
    offset: std::sync::atomic::AtomicUsize,
}

impl MmapAllocator {
    fn new(filename: &str, size: usize) -> std::io::Result<Self> {
        let file = File::create(filename)?;
        file.set_len(size as u64)?;

        let mmap = unsafe { MmapMut::map_mut(&file)? };

        Ok(Self {
            file,
            mmap,
            offset: std::sync::atomic::AtomicUsize::new(0),
        })
    }

    fn allocate_from_mmap(&self, size: usize, align: usize) -> Option<*mut u8> {
        use std::sync::atomic::Ordering;

        loop {
            let current_offset = self.offset.load(Ordering::Relaxed);
            let aligned_offset = (current_offset + align - 1) / align * align;
            let new_offset = aligned_offset + size;

            if new_offset > self.mmap.len() {
                return None;
            }

            match self.offset.compare_exchange_weak(
                current_offset,
                new_offset,
                Ordering::Relaxed,
                Ordering::Relaxed
            ) {
                Ok(_) => {
                    return Some(unsafe { self.mmap.as_mut_ptr().add(aligned_offset) });
                }
                Err(_) => continue,
            }
        }
    }
}

unsafe impl GlobalAlloc for MmapAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        self.allocate_from_mmap(layout.size(), layout.align())
            .unwrap_or(ptr::null_mut())
    }

    unsafe fn dealloc(&self, _ptr: *mut u8, _layout: Layout) {
        // Memory-mapped allocators typically don't free individual allocations
    }
}

Integrating Custom Allocators with Standard Collections

Rust's standard collections support custom allocators through the Allocator trait. This integration allows using specialized allocation strategies with Vec, HashMap, and other standard types without modifying application logic.

The allocator API provides a clean abstraction that separates allocation strategy from data structure implementation. This separation enables experimenting with different allocators while maintaining identical application interfaces.

use std::alloc::{Allocator, Layout, AllocError};
use std::ptr::NonNull;

struct TrackingAllocator<A: Allocator> {
    inner: A,
    allocations: std::sync::atomic::AtomicUsize,
    deallocations: std::sync::atomic::AtomicUsize,
}

impl<A: Allocator> TrackingAllocator<A> {
    fn new(allocator: A) -> Self {
        Self {
            inner: allocator,
            allocations: std::sync::atomic::AtomicUsize::new(0),
            deallocations: std::sync::atomic::AtomicUsize::new(0),
        }
    }

    fn stats(&self) -> (usize, usize) {
        (
            self.allocations.load(std::sync::atomic::Ordering::Relaxed),
            self.deallocations.load(std::sync::atomic::Ordering::Relaxed),
        )
    }
}

unsafe impl<A: Allocator> Allocator for TrackingAllocator<A> {
    fn allocate(&self, layout: Layout) -> Result<NonNull<[u8]>, AllocError> {
        let result = self.inner.allocate(layout);
        if result.is_ok() {
            self.allocations.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
        }
        result
    }

    unsafe fn deallocate(&self, ptr: NonNull<u8>, layout: Layout) {
        self.inner.deallocate(ptr, layout);
        self.deallocations.fetch_add(1, std::sync::atomic::Ordering::Relaxed);
    }
}

// Usage with standard collections
fn demonstrate_allocator_usage() {
    let allocator = TrackingAllocator::new(std::alloc::Global);
    let mut vec = Vec::new_in(&allocator);

    for i in 0..1000 {
        vec.push(i);
    }

    let (allocs, deallocs) = allocator.stats();
    println!("Allocations: {}, Deallocations: {}", allocs, deallocs);
}

Implementing Slab Allocators for Object Recycling

Slab allocators optimize for scenarios where applications frequently allocate and deallocate objects of similar sizes. By maintaining separate pools for different object sizes, slab allocators reduce fragmentation and improve allocation performance.

use std::alloc::{GlobalAlloc, Layout, System};
use std::collections::HashMap;
use std::sync::Mutex;
use std::ptr;

struct SlabAllocator {
    slabs: Mutex<HashMap<usize, Vec<*mut u8>>>,
    chunk_size: usize,
}

impl SlabAllocator {
    fn new(chunk_size: usize) -> Self {
        Self {
            slabs: Mutex::new(HashMap::new()),
            chunk_size,
        }
    }

    fn get_slab_size(size: usize) -> usize {
        // Round up to next power of 2
        let mut slab_size = 8;
        while slab_size < size {
            slab_size *= 2;
        }
        slab_size
    }

    fn allocate_slab(&self, slab_size: usize) -> Vec<*mut u8> {
        let num_objects = self.chunk_size / slab_size;
        let layout = Layout::from_size_align(self.chunk_size, 8).unwrap();

        let chunk = unsafe { System.alloc(layout) };
        if chunk.is_null() {
            return Vec::new();
        }

        let mut free_list = Vec::with_capacity(num_objects);
        for i in 0..num_objects {
            let obj_ptr = unsafe { chunk.add(i * slab_size) };
            free_list.push(obj_ptr);
        }

        free_list
    }
}

unsafe impl GlobalAlloc for SlabAllocator {
    unsafe fn alloc(&self, layout: Layout) -> *mut u8 {
        let slab_size = Self::get_slab_size(layout.size());
        let mut slabs = self.slabs.lock().unwrap();

        let free_list = slabs.entry(slab_size).or_insert_with(|| {
            self.allocate_slab(slab_size)
        });

        free_list.pop().unwrap_or_else(|| {
            // Allocate new slab if current one is exhausted
            let mut new_slab = self.allocate_slab(slab_size);
            let ptr = new_slab.pop().unwrap_or(ptr::null_mut());
            *free_list = new_slab;
            ptr
        })
    }

    unsafe fn dealloc(&self, ptr: *mut u8, layout: Layout) {
        let slab_size = Self::get_slab_size(layout.size());
        let mut slabs = self.slabs.lock().unwrap();

        if let Some(free_list) = slabs.get_mut(&slab_size) {
            free_list.push(ptr);
        }
    }
}

Performance Benchmarking and Optimization

Measuring allocator performance requires careful consideration of various metrics including allocation speed, deallocation speed, memory utilization, and fragmentation characteristics. I've developed benchmarking frameworks that test allocators under different workload patterns.

use std::time::Instant;
use std::alloc::{GlobalAlloc, Layout};

fn benchmark_allocator<A: GlobalAlloc>(allocator: &A, iterations: usize) {
    let layout = Layout::from_size_align(64, 8).unwrap();
    let mut ptrs = Vec::with_capacity(iterations);

    // Benchmark allocation
    let alloc_start = Instant::now();
    for _ in 0..iterations {
        let ptr = unsafe { allocator.alloc(layout) };
        if !ptr.is_null() {
            ptrs.push(ptr);
        }
    }
    let alloc_duration = alloc_start.elapsed();

    // Benchmark deallocation
    let dealloc_start = Instant::now();
    for ptr in ptrs {
        unsafe { allocator.dealloc(ptr, layout) };
    }
    let dealloc_duration = dealloc_start.elapsed();

    println!("Allocation: {:?}, Deallocation: {:?}", 
             alloc_duration, dealloc_duration);
}

Real-World Applications and Case Studies

High-frequency trading systems benefit enormously from custom allocators. I've implemented pool allocators for order book management where microsecond latencies matter. The predictable allocation patterns allowed achieving sub-microsecond allocation times.

Game engines represent another domain where custom allocators shine. Frame-based allocation patterns work well with stack allocators, while object pooling systems benefit from specialized pool allocators. The ability to reset entire memory regions between frames eliminates garbage collection pauses.

Embedded systems with limited memory benefit from custom allocators that minimize waste and provide predictable behavior. Fixed-size allocators ensure memory usage remains within strict bounds while providing deterministic allocation performance.

Data processing applications working with large datasets benefit from memory-mapped allocators. Stream processing systems can treat massive files as memory arrays while maintaining excellent performance characteristics.

Custom allocators provide powerful tools for optimizing application performance in specific domains. The key lies in understanding application memory patterns and selecting appropriate allocation strategies. Rust's flexible allocator system enables experimentation with different approaches while maintaining memory safety guarantees.

The investment in custom allocators pays dividends in applications where memory management represents a performance bottleneck. By tailoring allocation strategies to specific workload characteristics, developers can achieve performance improvements that would be impossible with general-purpose allocators.

📘 Checkout my latest ebook for free on my channel!

Be sure to like, share, comment, and subscribe to the channel!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

Our Creations

Be sure to check out our creations:

We are on Medium

DEV Community