James Lee

Posted on May 19

Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache

#architecture #computerscience #go #performance

When your Go program creates a struct or a slice, where does that memory actually come from? The answer involves a surprisingly deep stack: escape analysis, a multi-level cache hierarchy, span management, and ultimately mmap calls to the OS. Let's trace the full path.

1. The Core Problem: Bridging Objects and Pages

There's a fundamental gap between how applications think about memory and how the OS provides it:

Application view          Allocator bridge          OS view
─────────────────         ────────────────          ──────────
"I need a struct"    →    maps object ↔ block   →   mmap pages
"I need a slice"          manages size classes       (fixed-size chunks)

The mutator (your application code) thinks in terms of objects. The OS thinks in terms of pages. The allocator sits between them, handling:

Fragmentation — dividing memory into size-classed blocks to minimize waste
Concurrency — reducing lock contention for small, high-frequency allocations
Caching — pre-allocating memory to avoid repeated OS calls

This problem was solved in C/C++ by TCMalloc (Thread-Cache Malloc). Go's allocator is directly inspired by it.

2. Memory Allocation Basics

Three Allocation Regions

Region	Contents	Management
Static storage	Global vars, constants, root objects	Compiler-managed
Stack	Local variables, function frames	Auto (move stack pointer)
Heap	`make`, `new`, escaped objects	Runtime allocator + GC

The stack is fast — allocation is just a pointer increment. The heap is shared across goroutines, which introduces fragmentation and concurrency challenges.

The `brk` Problem

Early allocators used sbrk() to move the heap boundary:

Low address                          High address
┌──────────┬──────────┬─────────────────────────┐
│    A     │    B     │       free space         │  ← brk pointer
└──────────┴──────────┴─────────────────────────┘

If we free A but not B:
→ Can't move brk down (would destroy B)
→ A must be kept as a free cache block
→ Only when both A and B are free can we reclaim the space

This leads to the classic linked-list allocator: each block stores metadata (size, next pointer, in-use flag), and freed blocks are reused rather than returned to the OS.

Problems with a naive linked-list allocator:

Global lock → severe contention under high concurrency
Linear scan to find a free block → O(N) allocation time
Severe fragmentation (a 1GB free block wasted on a 1KB allocation)
Only the tail block can be returned to the OS

3. TCMalloc: The Foundation

TCMalloc solves these problems with three core ideas:

Key Concepts

Term	Definition
Page	Fixed-size memory unit (TCMalloc's page ≥ OS page, always a multiple)
Span / SpanList	A group of contiguous pages; spans of the same page count form a doubly-linked list
Object	A span is split into N equal-sized objects, forming a FreeList

Three-Layer Architecture

┌─────────────────────────────────────────────────────────┐
│                   TCMalloc Architecture                 │
│                                                         │
│  Thread 1        Thread 2        Thread 3               │
│  ┌──────────┐    ┌──────────┐    ┌──────────┐           │
│  │ThreadCache│   │ThreadCache│   │ThreadCache│  ← L1    │
│  │(no lock) │    │(no lock) │    │(no lock) │           │
│  └────┬─────┘    └────┬─────┘    └────┬─────┘           │
│       └───────────────┼───────────────┘                 │
│                  ┌────▼─────┐                           │
│                  │CentralCache│                ← L2     │
│                  │(with lock)│                          │
│                  └────┬─────┘                           │
│                  ┌────▼─────┐                           │
│                  │ PageHeap │                  ← L3     │
│                  │(Span mgmt)│                          │
│                  └──────────┘                           │
└─────────────────────────────────────────────────────────┘

ThreadCache — per-thread, lock-free. Each thread has its own free-list per size class.
CentralCache — shared across threads, requires a lock. Refills ThreadCache when depleted.
PageHeap — manages Spans. Splits a Span into objects for CentralCache when needed.

4. Go's Multi-Level Allocator

Go's allocator is based on TCMalloc with two key differences:

Finer size classification — 67 size classes instead of TCMalloc's coarser buckets
P-local cache instead of thread-local — cache is bound to the logical processor (P), not the OS thread

Object Size Classes

Category	Size Range	Allocator
tiny	< 16 bytes, no pointer	`mcache.tiny` allocator
small	16 bytes – 32 KB	`mspan` via `mcache`
large	> 32 KB	Directly from `mheap`

The Four-Level Cache Hierarchy

┌──────────────────────────────────────────────────────────┐
│              Go Memory Cache Hierarchy                   │
│                                                          │
│  L1: mcache.tiny        ← tiny objects (< 16B, noscan)  │
│  L2: mcache.alloc[]     ← per-P mspan cache (67 classes) │
│  L3: mheap.mcentral[]   ← shared span pool (with lock)  │
│  L4: mheap.arenas[]     ← page-level allocation (64MB)  │
│                ↓                                         │
│           OS (mmap)     ← when L4 is exhausted          │
└──────────────────────────────────────────────────────────┘

This mirrors CPU cache design: fast local access first, falling back to slower shared resources only when necessary.

5. Core Data Structures

`mspan` — The Basic Unit

type mspan struct {
    next      *mspan     // linked list pointer
    prev      *mspan
    startAddr uintptr    // start address of this span
    npages    uintptr    // number of 8KB pages in this span
    spanclass spanClass  // size class + noscan flag
    freeindex uintptr    // index of next free object
    nelems    uintptr    // total number of objects in span
    allocBits *gcBits    // bitmap: 1 = allocated, 0 = free
    gcmarkBits *gcBits   // GC tri-color marking bitmap
}

Key insight: mspan is allocated in whole pages (npages × 8KB), then subdivided into equal-sized objects. Think of it as: mspan = bulk page allocation + fine-grained object dispensing.

`mcache` — Per-P Cache (L1/L2)

type mcache struct {
    tiny       uintptr          // pointer to current 16-byte tiny block
    tinyoffset uintptr          // offset for next tiny allocation
    tinyAllocs uintptr          // number of tiny objects allocated
    alloc [numSpanClasses]*mspan // one mspan per size class (134 total)
}

Each P has its own mcache. Allocations from mcache require no locks.

`mcentral` — Shared Span Pool (L3)

type mcentral struct {
    lock      mutex
    spanclass spanClass
    partial   [2]spanSet  // spans with free space (swept/unswept)
    full      [2]spanSet  // spans with no free space
}

One mcentral per size class (67 × 2 = 134 mcentral instances in total — for scan and noscan variants). Lock is per-mcentral, so different size classes don't contend.

`mheap` — Global Heap Manager (L4)

type mheap struct {
    lock    mutex
    arenas  [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
    central [67*2]struct {
        mcentral mcentral
        pad      [...]byte  // cache line padding
    }
}

mheap is a single global instance. It manages the entire virtual address space via a 2D array of heapArena structs. On Linux x86-64: 1 × 4,194,304 arenas × 64MB = up to 256TB of addressable heap.

`heapArena` — 64MB Memory Block

type heapArena struct {
    spans      [pagesPerArena]*mspan  // one mspan pointer per 8KB page
    zeroedBase uintptr                // base address of this arena
}

6. Virtual Memory Layout

mheap.arenas (2D array)
│
├── heapArena[0]  (64MB)
│   ├── page 0  (8KB) → mspan A
│   ├── page 1  (8KB) → mspan A  (same span, multi-page)
│   ├── page 2  (8KB) → mspan B
│   └── ...
│
├── heapArena[1]  (64MB)
│   └── ...
└── ...

mspan internal layout:
┌─────────────────────────────────────┐
│  page 0  │  page 1  │  page 2      │  ← npages × 8KB
├──────┬───┴──┬────┬──┴───┬──────────┤
│ obj0 │ obj1 │obj2│ obj3 │  ...     │  ← objects (size = spanclass)
└──────┴──────┴────┴──────┴──────────┘

Page management uses a radix tree (pageAlloc) for O(1) free page lookup — evolved from freelist (O(N)) → treap (O(log N)) → radix tree (O(1)).

7. The Allocation Path

Here's what happens when your Go code allocates a 64-byte object:

goroutine on P
     ↓
compiler inserts runtime.newobject()
     ↓
┌─── size < 16B && noscan? ──────────────────┐
│    YES → mcache.tiny allocator             │
│    NO  ↓                                   │
├─── size > 32KB? ───────────────────────────┤
│    YES → mheap direct allocation           │
│    NO  ↓                                   │
└─── 16B–32KB: look up mcache.alloc[sizeclass]
          ↓
     mspan found with free slot?
          ├── YES → mark allocBits, return pointer  ✅ (no lock)
          └── NO  ↓
              mcentral.cacheSpan()  (lock per size class)
                  ↓
              mcentral has available span?
                  ├── YES → return span to mcache
                  └── NO  ↓
                      mheap.alloc()  (global lock)
                          ↓
                      heapArena has free pages?
                          ├── YES → carve out new mspan
                          └── NO  ↓
                              mmap() → OS

8. Escape Analysis: Stack vs Heap

Go's compiler decides at compile time whether a value lives on the stack or the heap using escape analysis.

Rule: if pointer p has a longer lifetime than value v,
      then v must be heap-allocated ("escapes to heap")

Common escape triggers:

Pointer returned from a function
Value passed to an interface
Value captured by a goroutine closure
Size determined dynamically (e.g. make([]int, n) where n is a variable)
Value stored in an already-escaped object (escape is transitive)

Inspect escape analysis:

go build -gcflags="-m" main.go

# Example output:
# main.go:4: make([]int, 10240) escapes to heap

Stack vs Heap Cost

Allocation	Cost	Managed by
Stack	Move stack pointer — essentially free	Compiler
Heap	Multi-level cache lookup + possible lock + possible mmap	Runtime allocator + GC

Implication: Minimizing heap escapes is one of the most effective Go performance optimizations. Use -gcflags="-m" to audit your hot paths.

9. Summary: The Full Picture

Go Object Allocation Flow
─────────────────────────
Source code: var x = &MyStruct{}
     ↓ (compile time)
Escape analysis → stack or heap?
     ↓ (heap path)
runtime.newobject(size)
     ↓
tiny / small / large routing
     ↓
mcache (no lock, P-local)
     ↓ miss
mcentral (per-size-class lock)
     ↓ miss
mheap (global, page-level)
     ↓ miss
mmap → OS

Layer	Scope	Lock	Granularity
`mcache`	Per-P	None	Object (size-classed)
`mcentral`	Per size class	Per-class mutex	mspan
`mheap`	Global	Global mutex	Pages (8KB)
OS (`mmap`)	System	Kernel	Arena (64MB)

Next in this series: Go I/O Optimization: goroutine-per-connection, epoll & the Reader/Writer Interface (Part 3)

Found this useful? Follow the series for more deep dives into Go's runtime internals.

DEV Community

Go Heap Memory Allocation: tcmalloc, Mutator/Allocator & Multi-Level Cache

1. The Core Problem: Bridging Objects and Pages

2. Memory Allocation Basics

Three Allocation Regions

The `brk` Problem

3. TCMalloc: The Foundation

Key Concepts

Three-Layer Architecture

4. Go's Multi-Level Allocator

Object Size Classes

The Four-Level Cache Hierarchy

5. Core Data Structures

`mspan` — The Basic Unit

`mcache` — Per-P Cache (L1/L2)

`mcentral` — Shared Span Pool (L3)

`mheap` — Global Heap Manager (L4)

`heapArena` — 64MB Memory Block

6. Virtual Memory Layout

7. The Allocation Path

8. Escape Analysis: Stack vs Heap

Stack vs Heap Cost

9. Summary: The Full Picture

Top comments (0)

1. The Core Problem: Bridging Objects and Pages

2. Memory Allocation Basics

Three Allocation Regions

The brk Problem

3. TCMalloc: The Foundation

Key Concepts

Three-Layer Architecture

4. Go's Multi-Level Allocator

Object Size Classes

The Four-Level Cache Hierarchy

5. Core Data Structures

mspan — The Basic Unit

mcache — Per-P Cache (L1/L2)

mcentral — Shared Span Pool (L3)

mheap — Global Heap Manager (L4)

heapArena — 64MB Memory Block

6. Virtual Memory Layout

7. The Allocation Path

8. Escape Analysis: Stack vs Heap

Stack vs Heap Cost

9. Summary: The Full Picture

The `brk` Problem

`mspan` — The Basic Unit

`mcache` — Per-P Cache (L1/L2)

`mcentral` — Shared Span Pool (L3)

`mheap` — Global Heap Manager (L4)

`heapArena` — 64MB Memory Block