DEV Community

James Lee
James Lee

Posted on

Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache

When your Go program creates a struct or a slice, where does that memory actually come from? The answer involves a surprisingly deep stack: escape analysis, a multi-level cache hierarchy, span management, and ultimately mmap calls to the OS. Let's trace the full path.


1. The Core Problem: Bridging Objects and Pages

There's a fundamental gap between how applications think about memory and how the OS provides it:

Application view          Allocator bridge          OS view
─────────────────         ────────────────          ──────────
"I need a struct"    →    maps object ↔ block   →   mmap pages
"I need a slice"          manages size classes       (fixed-size chunks)
Enter fullscreen mode Exit fullscreen mode

The mutator (your application code) thinks in terms of objects. The OS thinks in terms of pages. The allocator sits between them, handling:

  • Fragmentation — dividing memory into size-classed blocks to minimize waste
  • Concurrency — reducing lock contention for small, high-frequency allocations
  • Caching — pre-allocating memory to avoid repeated OS calls

This problem was solved in C/C++ by TCMalloc (Thread-Cache Malloc). Go's allocator is directly inspired by it.


2. Memory Allocation Basics

Three Allocation Regions

Region Contents Management
Static storage Global vars, constants, root objects Compiler-managed
Stack Local variables, function frames Auto (move stack pointer)
Heap make, new, escaped objects Runtime allocator + GC

The stack is fast — allocation is just a pointer increment. The heap is shared across goroutines, which introduces fragmentation and concurrency challenges.

The brk Problem

Early allocators used sbrk() to move the heap boundary:

Low address                          High address
┌──────────┬──────────┬─────────────────────────┐
│    A     │    B     │       free space         │  ← brk pointer
└──────────┴──────────┴─────────────────────────┘

If we free A but not B:
→ Can't move brk down (would destroy B)
→ A must be kept as a free cache block
→ Only when both A and B are free can we reclaim the space
Enter fullscreen mode Exit fullscreen mode

This leads to the classic linked-list allocator: each block stores metadata (size, next pointer, in-use flag), and freed blocks are reused rather than returned to the OS.

Problems with a naive linked-list allocator:

  • Global lock → severe contention under high concurrency
  • Linear scan to find a free block → O(N) allocation time
  • Severe fragmentation (a 1GB free block wasted on a 1KB allocation)
  • Only the tail block can be returned to the OS

3. TCMalloc: The Foundation

TCMalloc solves these problems with three core ideas:

Key Concepts

Term Definition
Page Fixed-size memory unit (TCMalloc's page ≥ OS page, always a multiple)
Span / SpanList A group of contiguous pages; spans of the same page count form a doubly-linked list
Object A span is split into N equal-sized objects, forming a FreeList

Three-Layer Architecture

┌─────────────────────────────────────────────────────────┐
│                   TCMalloc Architecture                 │
│                                                         │
│  Thread 1        Thread 2        Thread 3               │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐           │
│  │ThreadCache│   │ThreadCache│   │ThreadCache│  ← L1    │
│  │(no lock) │    │(no lock) │    │(no lock) │           │
│  └────┬─────┘    └────┬─────┘    └────┬─────┘           │
│       └───────────────┼───────────────┘                 │
│                  ┌────▼─────┐                           │
│                  │CentralCache│                ← L2     │
│                  │(with lock)│                          │
│                  └────┬─────┘                           │
│                  ┌────▼─────┐                           │
│                  │ PageHeap │                  ← L3     │
│                  │(Span mgmt)│                          │
│                  └──────────┘                           │
└─────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode
  • ThreadCache — per-thread, lock-free. Each thread has its own free-list per size class.
  • CentralCache — shared across threads, requires a lock. Refills ThreadCache when depleted.
  • PageHeap — manages Spans. Splits a Span into objects for CentralCache when needed.

4. Go's Multi-Level Allocator

Go's allocator is based on TCMalloc with two key differences:

  1. Finer size classification — 67 size classes instead of TCMalloc's coarser buckets
  2. P-local cache instead of thread-local — cache is bound to the logical processor (P), not the OS thread

Object Size Classes

Category Size Range Allocator
tiny < 16 bytes, no pointer mcache.tiny allocator
small 16 bytes – 32 KB mspan via mcache
large > 32 KB Directly from mheap

The Four-Level Cache Hierarchy

┌──────────────────────────────────────────────────────────┐
│              Go Memory Cache Hierarchy                   │
│                                                          │
│  L1: mcache.tiny        ← tiny objects (< 16B, noscan)  │
│  L2: mcache.alloc[]     ← per-P mspan cache (67 classes) │
│  L3: mheap.mcentral[]   ← shared span pool (with lock)  │
│  L4: mheap.arenas[]     ← page-level allocation (64MB)  │
│                ↓                                         │
│           OS (mmap)     ← when L4 is exhausted          │
└──────────────────────────────────────────────────────────┘
Enter fullscreen mode Exit fullscreen mode

This mirrors CPU cache design: fast local access first, falling back to slower shared resources only when necessary.


5. Core Data Structures

mspan — The Basic Unit

type mspan struct {
    next      *mspan     // linked list pointer
    prev      *mspan
    startAddr uintptr    // start address of this span
    npages    uintptr    // number of 8KB pages in this span
    spanclass spanClass  // size class + noscan flag
    freeindex uintptr    // index of next free object
    nelems    uintptr    // total number of objects in span
    allocBits *gcBits    // bitmap: 1 = allocated, 0 = free
    gcmarkBits *gcBits   // GC tri-color marking bitmap
}
Enter fullscreen mode Exit fullscreen mode

Key insight: mspan is allocated in whole pages (npages × 8KB), then subdivided into equal-sized objects. Think of it as: mspan = bulk page allocation + fine-grained object dispensing.

mcache — Per-P Cache (L1/L2)

type mcache struct {
    tiny       uintptr          // pointer to current 16-byte tiny block
    tinyoffset uintptr          // offset for next tiny allocation
    tinyAllocs uintptr          // number of tiny objects allocated
    alloc [numSpanClasses]*mspan // one mspan per size class (134 total)
}
Enter fullscreen mode Exit fullscreen mode

Each P has its own mcache. Allocations from mcache require no locks.

mcentral — Shared Span Pool (L3)

type mcentral struct {
    lock      mutex
    spanclass spanClass
    partial   [2]spanSet  // spans with free space (swept/unswept)
    full      [2]spanSet  // spans with no free space
}
Enter fullscreen mode Exit fullscreen mode

One mcentral per size class (67 × 2 = 134 mcentral instances in total — for scan and noscan variants). Lock is per-mcentral, so different size classes don't contend.

mheap — Global Heap Manager (L4)

type mheap struct {
    lock    mutex
    arenas  [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
    central [67*2]struct {
        mcentral mcentral
        pad      [...]byte  // cache line padding
    }
}
Enter fullscreen mode Exit fullscreen mode

mheap is a single global instance. It manages the entire virtual address space via a 2D array of heapArena structs. On Linux x86-64: 1 × 4,194,304 arenas × 64MB = up to 256TB of addressable heap.

heapArena — 64MB Memory Block

type heapArena struct {
    spans      [pagesPerArena]*mspan  // one mspan pointer per 8KB page
    zeroedBase uintptr                // base address of this arena
}
Enter fullscreen mode Exit fullscreen mode

6. Virtual Memory Layout

mheap.arenas (2D array)
│
├── heapArena[0]  (64MB)
│   ├── page 0  (8KB) → mspan A
│   ├── page 1  (8KB) → mspan A  (same span, multi-page)
│   ├── page 2  (8KB) → mspan B
│   └── ...
│
├── heapArena[1]  (64MB)
│   └── ...
└── ...

mspan internal layout:
┌─────────────────────────────────────┐
│  page 0  │  page 1  │  page 2      │  ← npages × 8KB
├──────┬───┴──┬────┬──┴───┬──────────┤
│ obj0 │ obj1 │obj2│ obj3 │  ...     │  ← objects (size = spanclass)
└──────┴──────┴────┴──────┴──────────┘
Enter fullscreen mode Exit fullscreen mode

Page management uses a radix tree (pageAlloc) for O(1) free page lookup — evolved from freelist (O(N)) → treap (O(log N)) → radix tree (O(1)).


7. The Allocation Path

Here's what happens when your Go code allocates a 64-byte object:

goroutine on P
     ↓
compiler inserts runtime.newobject()
     ↓
┌─── size < 16B && noscan? ──────────────────┐
│    YES → mcache.tiny allocator             │
│    NO  ↓                                   │
├─── size > 32KB? ───────────────────────────┤
│    YES → mheap direct allocation           │
│    NO  ↓                                   │
└─── 16B–32KB: look up mcache.alloc[sizeclass]
          ↓
     mspan found with free slot?
          ├── YES → mark allocBits, return pointer  ✅ (no lock)
          └── NO  ↓
              mcentral.cacheSpan()  (lock per size class)
                  ↓
              mcentral has available span?
                  ├── YES → return span to mcache
                  └── NO  ↓
                      mheap.alloc()  (global lock)
                          ↓
                      heapArena has free pages?
                          ├── YES → carve out new mspan
                          └── NO  ↓
                              mmap() → OS
Enter fullscreen mode Exit fullscreen mode

8. Escape Analysis: Stack vs Heap

Go's compiler decides at compile time whether a value lives on the stack or the heap using escape analysis.

Rule: if pointer p has a longer lifetime than value v,
      then v must be heap-allocated ("escapes to heap")
Enter fullscreen mode Exit fullscreen mode

Common escape triggers:

  • Pointer returned from a function
  • Value passed to an interface
  • Value captured by a goroutine closure
  • Size determined dynamically (e.g. make([]int, n) where n is a variable)
  • Value stored in an already-escaped object (escape is transitive)

Inspect escape analysis:

go build -gcflags="-m" main.go

# Example output:
# main.go:4: make([]int, 10240) escapes to heap
Enter fullscreen mode Exit fullscreen mode

Stack vs Heap Cost

Allocation Cost Managed by
Stack Move stack pointer — essentially free Compiler
Heap Multi-level cache lookup + possible lock + possible mmap Runtime allocator + GC

Implication: Minimizing heap escapes is one of the most effective Go performance optimizations. Use -gcflags="-m" to audit your hot paths.


9. Summary: The Full Picture

Go Object Allocation Flow
─────────────────────────
Source code: var x = &MyStruct{}
     ↓ (compile time)
Escape analysis → stack or heap?
     ↓ (heap path)
runtime.newobject(size)
     ↓
tiny / small / large routing
     ↓
mcache (no lock, P-local)
     ↓ miss
mcentral (per-size-class lock)
     ↓ miss
mheap (global, page-level)
     ↓ miss
mmap → OS
Enter fullscreen mode Exit fullscreen mode
Layer Scope Lock Granularity
mcache Per-P None Object (size-classed)
mcentral Per size class Per-class mutex mspan
mheap Global Global mutex Pages (8KB)
OS (mmap) System Kernel Arena (64MB)

Next in this series: Go I/O Optimization: goroutine-per-connection, epoll & the Reader/Writer Interface (Part 3)


Found this useful? Follow the series for more deep dives into Go's runtime internals.

Top comments (0)