When your Go program creates a struct or a slice, where does that memory actually come from? The answer involves a surprisingly deep stack: escape analysis, a multi-level cache hierarchy, span management, and ultimately mmap calls to the OS. Let's trace the full path.
1. The Core Problem: Bridging Objects and Pages
There's a fundamental gap between how applications think about memory and how the OS provides it:
Application view Allocator bridge OS view
───────────────── ──────────────── ──────────
"I need a struct" → maps object ↔ block → mmap pages
"I need a slice" manages size classes (fixed-size chunks)
The mutator (your application code) thinks in terms of objects. The OS thinks in terms of pages. The allocator sits between them, handling:
- Fragmentation — dividing memory into size-classed blocks to minimize waste
- Concurrency — reducing lock contention for small, high-frequency allocations
- Caching — pre-allocating memory to avoid repeated OS calls
This problem was solved in C/C++ by TCMalloc (Thread-Cache Malloc). Go's allocator is directly inspired by it.
2. Memory Allocation Basics
Three Allocation Regions
| Region | Contents | Management |
|---|---|---|
| Static storage | Global vars, constants, root objects | Compiler-managed |
| Stack | Local variables, function frames | Auto (move stack pointer) |
| Heap |
make, new, escaped objects |
Runtime allocator + GC |
The stack is fast — allocation is just a pointer increment. The heap is shared across goroutines, which introduces fragmentation and concurrency challenges.
The brk Problem
Early allocators used sbrk() to move the heap boundary:
Low address High address
┌──────────┬──────────┬─────────────────────────┐
│ A │ B │ free space │ ← brk pointer
└──────────┴──────────┴─────────────────────────┘
If we free A but not B:
→ Can't move brk down (would destroy B)
→ A must be kept as a free cache block
→ Only when both A and B are free can we reclaim the space
This leads to the classic linked-list allocator: each block stores metadata (size, next pointer, in-use flag), and freed blocks are reused rather than returned to the OS.
Problems with a naive linked-list allocator:
- Global lock → severe contention under high concurrency
- Linear scan to find a free block → O(N) allocation time
- Severe fragmentation (a 1GB free block wasted on a 1KB allocation)
- Only the tail block can be returned to the OS
3. TCMalloc: The Foundation
TCMalloc solves these problems with three core ideas:
Key Concepts
| Term | Definition |
|---|---|
| Page | Fixed-size memory unit (TCMalloc's page ≥ OS page, always a multiple) |
| Span / SpanList | A group of contiguous pages; spans of the same page count form a doubly-linked list |
| Object | A span is split into N equal-sized objects, forming a FreeList |
Three-Layer Architecture
┌─────────────────────────────────────────────────────────┐
│ TCMalloc Architecture │
│ │
│ Thread 1 Thread 2 Thread 3 │
│ ┌──────────┐ ┌──────────┐ ┌──────────┐ │
│ │ThreadCache│ │ThreadCache│ │ThreadCache│ ← L1 │
│ │(no lock) │ │(no lock) │ │(no lock) │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
│ └───────────────┼───────────────┘ │
│ ┌────▼─────┐ │
│ │CentralCache│ ← L2 │
│ │(with lock)│ │
│ └────┬─────┘ │
│ ┌────▼─────┐ │
│ │ PageHeap │ ← L3 │
│ │(Span mgmt)│ │
│ └──────────┘ │
└─────────────────────────────────────────────────────────┘
- ThreadCache — per-thread, lock-free. Each thread has its own free-list per size class.
- CentralCache — shared across threads, requires a lock. Refills ThreadCache when depleted.
- PageHeap — manages Spans. Splits a Span into objects for CentralCache when needed.
4. Go's Multi-Level Allocator
Go's allocator is based on TCMalloc with two key differences:
- Finer size classification — 67 size classes instead of TCMalloc's coarser buckets
- P-local cache instead of thread-local — cache is bound to the logical processor (P), not the OS thread
Object Size Classes
| Category | Size Range | Allocator |
|---|---|---|
| tiny | < 16 bytes, no pointer |
mcache.tiny allocator |
| small | 16 bytes – 32 KB |
mspan via mcache
|
| large | > 32 KB | Directly from mheap
|
The Four-Level Cache Hierarchy
┌──────────────────────────────────────────────────────────┐
│ Go Memory Cache Hierarchy │
│ │
│ L1: mcache.tiny ← tiny objects (< 16B, noscan) │
│ L2: mcache.alloc[] ← per-P mspan cache (67 classes) │
│ L3: mheap.mcentral[] ← shared span pool (with lock) │
│ L4: mheap.arenas[] ← page-level allocation (64MB) │
│ ↓ │
│ OS (mmap) ← when L4 is exhausted │
└──────────────────────────────────────────────────────────┘
This mirrors CPU cache design: fast local access first, falling back to slower shared resources only when necessary.
5. Core Data Structures
mspan — The Basic Unit
type mspan struct {
next *mspan // linked list pointer
prev *mspan
startAddr uintptr // start address of this span
npages uintptr // number of 8KB pages in this span
spanclass spanClass // size class + noscan flag
freeindex uintptr // index of next free object
nelems uintptr // total number of objects in span
allocBits *gcBits // bitmap: 1 = allocated, 0 = free
gcmarkBits *gcBits // GC tri-color marking bitmap
}
Key insight:
mspanis allocated in whole pages (npages × 8KB), then subdivided into equal-sized objects. Think of it as: mspan = bulk page allocation + fine-grained object dispensing.
mcache — Per-P Cache (L1/L2)
type mcache struct {
tiny uintptr // pointer to current 16-byte tiny block
tinyoffset uintptr // offset for next tiny allocation
tinyAllocs uintptr // number of tiny objects allocated
alloc [numSpanClasses]*mspan // one mspan per size class (134 total)
}
Each P has its own mcache. Allocations from mcache require no locks.
mcentral — Shared Span Pool (L3)
type mcentral struct {
lock mutex
spanclass spanClass
partial [2]spanSet // spans with free space (swept/unswept)
full [2]spanSet // spans with no free space
}
One mcentral per size class (67 × 2 = 134 mcentral instances in total — for scan and noscan variants). Lock is per-mcentral, so different size classes don't contend.
mheap — Global Heap Manager (L4)
type mheap struct {
lock mutex
arenas [1 << arenaL1Bits]*[1 << arenaL2Bits]*heapArena
central [67*2]struct {
mcentral mcentral
pad [...]byte // cache line padding
}
}
mheap is a single global instance. It manages the entire virtual address space via a 2D array of heapArena structs. On Linux x86-64: 1 × 4,194,304 arenas × 64MB = up to 256TB of addressable heap.
heapArena — 64MB Memory Block
type heapArena struct {
spans [pagesPerArena]*mspan // one mspan pointer per 8KB page
zeroedBase uintptr // base address of this arena
}
6. Virtual Memory Layout
mheap.arenas (2D array)
│
├── heapArena[0] (64MB)
│ ├── page 0 (8KB) → mspan A
│ ├── page 1 (8KB) → mspan A (same span, multi-page)
│ ├── page 2 (8KB) → mspan B
│ └── ...
│
├── heapArena[1] (64MB)
│ └── ...
└── ...
mspan internal layout:
┌─────────────────────────────────────┐
│ page 0 │ page 1 │ page 2 │ ← npages × 8KB
├──────┬───┴──┬────┬──┴───┬──────────┤
│ obj0 │ obj1 │obj2│ obj3 │ ... │ ← objects (size = spanclass)
└──────┴──────┴────┴──────┴──────────┘
Page management uses a radix tree (pageAlloc) for O(1) free page lookup — evolved from freelist (O(N)) → treap (O(log N)) → radix tree (O(1)).
7. The Allocation Path
Here's what happens when your Go code allocates a 64-byte object:
goroutine on P
↓
compiler inserts runtime.newobject()
↓
┌─── size < 16B && noscan? ──────────────────┐
│ YES → mcache.tiny allocator │
│ NO ↓ │
├─── size > 32KB? ───────────────────────────┤
│ YES → mheap direct allocation │
│ NO ↓ │
└─── 16B–32KB: look up mcache.alloc[sizeclass]
↓
mspan found with free slot?
├── YES → mark allocBits, return pointer ✅ (no lock)
└── NO ↓
mcentral.cacheSpan() (lock per size class)
↓
mcentral has available span?
├── YES → return span to mcache
└── NO ↓
mheap.alloc() (global lock)
↓
heapArena has free pages?
├── YES → carve out new mspan
└── NO ↓
mmap() → OS
8. Escape Analysis: Stack vs Heap
Go's compiler decides at compile time whether a value lives on the stack or the heap using escape analysis.
Rule: if pointer p has a longer lifetime than value v,
then v must be heap-allocated ("escapes to heap")
Common escape triggers:
- Pointer returned from a function
- Value passed to an interface
- Value captured by a goroutine closure
- Size determined dynamically (e.g.
make([]int, n)wherenis a variable) - Value stored in an already-escaped object (escape is transitive)
Inspect escape analysis:
go build -gcflags="-m" main.go
# Example output:
# main.go:4: make([]int, 10240) escapes to heap
Stack vs Heap Cost
| Allocation | Cost | Managed by |
|---|---|---|
| Stack | Move stack pointer — essentially free | Compiler |
| Heap | Multi-level cache lookup + possible lock + possible mmap | Runtime allocator + GC |
Implication: Minimizing heap escapes is one of the most effective Go performance optimizations. Use
-gcflags="-m"to audit your hot paths.
9. Summary: The Full Picture
Go Object Allocation Flow
─────────────────────────
Source code: var x = &MyStruct{}
↓ (compile time)
Escape analysis → stack or heap?
↓ (heap path)
runtime.newobject(size)
↓
tiny / small / large routing
↓
mcache (no lock, P-local)
↓ miss
mcentral (per-size-class lock)
↓ miss
mheap (global, page-level)
↓ miss
mmap → OS
| Layer | Scope | Lock | Granularity |
|---|---|---|---|
mcache |
Per-P | None | Object (size-classed) |
mcentral |
Per size class | Per-class mutex | mspan |
mheap |
Global | Global mutex | Pages (8KB) |
OS (mmap) |
System | Kernel | Arena (64MB) |
Next in this series: Go I/O Optimization: goroutine-per-connection, epoll & the Reader/Writer Interface (Part 3)
Found this useful? Follow the series for more deep dives into Go's runtime internals.
Top comments (0)