- Mount Mayhem — Netflix's internal investigation into container throughput below hardware expectation
- CPU topology — the real culprit: cores allocated from different physical dies can't share L3 cache
- NUMA effects — memory access speed depends on physical proximity; cross-NUMA is measurably slower
- Multi-tenant fragmentation — containers compete for cores; topology violations emerge under shared load
- Benchmarks miss it — dedicated hardware gives natural die locality; production multi-tenant hosts don't
- Fix: topology-aware scheduling in Netflix Titus — allocate die-local cores, pin with cgroup cpuset
Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'
The Story
Modern CPUs are not the flat, uniform compute resources that operating system schedulers often treat them as. A 96-core server CPU is actually a hierarchical structure: multiple physical processor dies, each containing cores that share L3 cache, connected to memory via NUMA (Non-Uniform Memory Access — a CPU architecture where memory access speed depends on the physical proximity of the memory module to the CPU core accessing it, with local memory being faster than remote memory) domains. Two cores on the same die can share cache and access memory very quickly. Two cores on different dies, or different NUMA nodes, pay a penalty for cross-die communication and remote memory access. When the Linux kernel schedules container workloads across these cores, it doesn't always respect these hardware boundaries — and at Netflix's scale, the performance consequences are measurable and significant.
Netflix runs its container workloads on Titus (Netflix's in-house container orchestration platform built on top of Apache Mesos, used to run containerised workloads across Netflix's AWS fleet). The Mount Mayhem investigation revealed that even when a container had adequate CPU allocation, the placement of those CPUs across the host machine's CPU topology (the physical and logical organisation of CPU cores, cache hierarchies, and memory domains on a multi-socket or multi-die server) could dramatically affect performance. A workload allocated 4 cores on a single NUMA node performs very differently from the same workload allocated 4 cores spread across multiple NUMA nodes — even if the raw core count is identical.
Problem
Container Throughput Below Hardware Expectation
Certain Netflix container workloads showed throughput below what the allocated CPU resources should deliver. Benchmarks on isolated machines showed good performance; production deployments on shared multi-tenant hosts showed degraded performance. The gap suggested a scheduling or placement issue rather than an application bug.
Cause
CPU Scheduling Across NUMA and Die Boundaries
The Linux kernel's CFS scheduler allocates cores without necessarily respecting CPU die and NUMA boundaries. A container allocated 4 cores might receive cores from 2 different physical dies, or 2 different NUMA nodes. Cross-die and cross-NUMA communication is significantly slower than intra-die communication, degrading cache efficiency and increasing memory access latency.
Solution
Topology-Aware Scheduling in Titus
Netflix experimented with CPU pinning and topology-aware scheduling in Titus — allocating cores to containers from the same physical die and NUMA node wherever possible. This kept container workloads' cache working sets on a single die's L3 cache, eliminating cross-die cache misses.
Result
Measurable Throughput Improvement for Topology-Sensitive Workloads
Topology-aware scheduling produced measurable throughput improvements for workloads sensitive to CPU cache locality — particularly high-throughput, latency-sensitive services. The investigation produced insights applicable across Netflix's Titus fleet.
The Fix
Topology-Aware Scheduling in Netflix Titus
The fix was implemented at the container scheduler level in Titus. Rather than treating all cores as equivalent, the topology-aware scheduler models the host's CPU topology — which cores belong to which die, which dies share L3 cache, which cores are in which NUMA node — and uses this model to make allocation decisions that minimise cross-die and cross-NUMA placements.
- Die-local — core allocation strategy for topology-sensitive workloads; all container threads on cores from the same physical die to maximise L3 cache sharing
- NUMA-aware — memory allocation paired with CPU pinning; container memory allocated on the NUMA node closest to its CPU cores
- Per-workload — scheduling policy applied based on workload characteristics; not all workloads benefit equally
- cpuset pinning — cgroup cpuset used to prevent Linux CFS from migrating threads across die boundaries after placement
# Simplified topology-aware CPU allocation logic
# Real implementation in Netflix Titus uses Go and integrates with Linux cgroups
class TopologyAwareCPUAllocator:
def __init__(self, host_topology: CPUTopology):
# host_topology describes: dies, cores per die, NUMA nodes, L3 cache sizes
self.topology = host_topology
def allocate_cores(
self,
num_cores: int,
workload_type: WorkloadType
) -> list[int]:
if workload_type == WorkloadType.LATENCY_SENSITIVE:
# Prefer cores from a single die — maximise L3 cache sharing
# High-throughput API servers, caching layers, media serving
return self._allocate_from_single_die(num_cores)
elif workload_type == WorkloadType.BATCH:
# Allow cross-die allocation — prioritise utilisation over locality
# Large independent tasks; working sets exceed L3 cache anyway
return self._allocate_for_utilization(num_cores)
else:
# Default: try single die, fall back to NUMA-aware cross-die
cores = self._allocate_from_single_die(num_cores)
if cores is None:
# No single die has enough free cores
# Cross-die but stay NUMA-local to avoid the worst penalty
cores = self._allocate_numa_aware(num_cores)
return cores
def _allocate_from_single_die(self, num_cores: int) -> list[int] | None:
for die in self.topology.dies:
available = die.available_cores()
if len(available) >= num_cores:
# All cores from same die — L3 cache shared, no inter-die penalty
return available[:num_cores]
return None # no single die has enough; caller falls back to NUMA-aware
def _allocate_numa_aware(self, num_cores: int) -> list[int]:
# Cross-die but NUMA-aware: prefer cores on same NUMA node
# Avoids cross-NUMA memory penalty when cross-die is unavoidable
return self.topology.allocate_numa_local(num_cores)
The L3 cache (the largest on-chip cache memory, shared among all cores within a physical die — typically 32–128MB — used to reduce expensive main memory accesses) is the critical resource in this story. When threads on different cores access the same data, the L3 cache is where they can find it without going to main memory. But L3 caches are per-die — cores on different physical dies don't share one. If a container's threads are split across dies, their shared data must travel through inter-die interconnects, which are significantly slower than intra-die L3 cache access.
Multi-tenant fragmentation: why this only appears in production
In a dedicated single-tenant host, a container requesting 4 cores naturally gets them from a single die because there's no competition for cores. In Netflix's multi-tenant production environment, multiple containers compete for cores on the same host simultaneously. The first container might claim cores 0–3 (Die 1). The second claims 4–7 (still Die 1). The third gets 24–27 — which are on Die 2, forcing cross-die allocation. Multi-tenancy makes topology fragmentation an emergent property that doesn't appear in single-container benchmarks.
NUMA memory allocation paired with CPU pinning
CPU topology-aware scheduling alone is insufficient if memory allocation doesn't follow the same locality rules. When container threads run on Die 1 cores but their memory is allocated on the NUMA node associated with Die 2, memory accesses still pay the NUMA penalty. Effective topology-aware scheduling requires paired NUMA-aware memory allocation — using Linux kernel mechanisms like numactl or mbind to bind container memory allocations to the NUMA node local to the allocated CPU cores.
The utilisation trade-off
Topology-aware scheduling improves per-container performance but can reduce overall host utilisation. If Die 1 has 12 cores and only 10 are free, a topology-aware scheduler won't place a 4-core container there even though there are cores available — it needs 4 die-local cores. This leaves cores underutilised to preserve topology quality. Netflix tunes this trade-off per workload class: topology-sensitive services get strict die-local allocation; batch workloads accept cross-die placement in exchange for higher host utilisation.
Architecture
Understanding the CPU topology problem requires a brief grounding in modern CPU physical architecture. A 96-core server CPU is typically 4–8 physical dies, each with 12–24 cores sharing an L3 cache. The dies are connected via a high-speed but non-L3 interconnect. When the operating system schedules threads, it sees 96 logical CPUs and can place threads anywhere. The hardware provides very different performance depending on whether threads land on the same die or different dies.
Modern CPU Physical Topology: Two Dies, Four NUMA Nodes
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Container Allocation: Topology-Unaware vs Topology-Aware
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Lessons
Modern CPUs are not flat. 96 cores on one CPU are not 96 equivalent resources — they're organised in a hierarchy of dies, NUMA nodes, and L3 caches that dramatically affect performance depending on how workloads are placed. Container schedulers that treat all cores as equivalent leave performance on the table for topology-sensitive workloads.
NUMA effects compound with L3 cache locality effects. Topology-aware scheduling must address both: allocate cores from the same die, and allocate memory from the NUMA node local to those cores. Fixing CPU placement without fixing memory placement leaves half the penalty in place.
Benchmarks on dedicated hardware systematically underestimate topology-sensitivity. A benchmark with exclusive machine access gets natural die locality. Production multi-tenant hosts fragment core allocations across dies under load. Always validate performance on production-equivalent multi-tenant hosts, not just isolated benchmark environments.
Topology-aware scheduling should be workload-aware, not universally applied. Latency-sensitive services with shared working sets benefit significantly from die-local core allocation. Batch workloads with large independent tasks often benefit less. Apply topology-aware policies where they deliver measurable improvements and use flexible allocation elsewhere to maximise host utilisation.
CPU affinity pinning prevents the Linux scheduler from undoing topology-aware placement. Without pinning, CFS load balancing can migrate threads across die boundaries to equalise load — destroying the cache locality that topology-aware placement achieved. Topology-aware scheduling requires affinity pinning to be durable under load.
Engineering Glossary
cgroup cpuset — a Linux kernel mechanism that restricts which CPU cores a process or container can use. Netflix's Titus topology-aware scheduler sets the cpuset to cores on a single die and uses it as an enforcement mechanism to prevent CFS from migrating threads across die boundaries.
CFS (Completely Fair Scheduler) — the default Linux kernel thread scheduler, designed for a world where CPU cores are equivalent. On modern multi-die CPUs, CFS can migrate threads across die boundaries while load balancing, inadvertently destroying cache locality.
CPU topology — the physical and logical organisation of CPU cores, cache hierarchies, and memory domains on a multi-socket or multi-die server. Includes: which cores share which L3 cache, which dies are on which NUMA nodes, and which interconnects link the dies.
Inter-die interconnect — the high-speed but non-L3-cache link connecting multiple physical dies on a modern CPU package. Slower than intra-die L3 cache access. Examples: AMD Infinity Fabric, Intel Ring Bus. The performance gap between intra-die and cross-die access is the root of the topology problem.
L3 cache — the largest on-chip cache memory, shared among all cores within a physical die (typically 32–128MB). Reduces expensive main memory accesses. Per-die — cores on different dies don't share L3 cache, meaning cross-die thread communication requires inter-die interconnect or main memory.
NUMA (Non-Uniform Memory Access) — a CPU architecture where memory access speed depends on the physical proximity of the memory module to the CPU core accessing it. Local memory (on the same NUMA node as the core) is faster than remote memory (on a different NUMA node). Must be addressed alongside CPU topology for full performance improvement.
Titus — Netflix's in-house container orchestration platform, built on Apache Mesos, managing containerised workloads across Netflix's AWS fleet. The Mount Mayhem topology-aware scheduling improvements were implemented at the Titus scheduling layer.
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack →
(Interactive diagrams, source links, and the full reader experience)
TechLogStack — built at scale, broken in public, rebuilt by engineers.
Top comments (0)