Netflix · Performance · 17 May 2026
Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'
- Mount Mayhem investigation
- Modern multi-core CPU topology
- NUMA and cache locality
- CPU pinning experiments
- Container scheduler interaction
- Netflix Titus platform
The Story
Modern CPUs are not the flat, uniform compute resources that operating system schedulers often treat them as. A 96-core server CPU is actually a hierarchical structure: multiple physical processor dies , each containing cores that share L3 cache, connected to memory via NUMA (Non-Uniform Memory Access — a CPU architecture where memory access speed depends on the physical proximity of the memory module to the CPU core accessing it, with local memory being faster than remote memory) domains. Two cores on the same die can share cache and access memory very quickly. Two cores on different dies, or different NUMA nodes, pay a penalty for cross-die communication and remote memory access. When the Linux kernel schedules container workloads across these cores, it doesn't always respect these hardware boundaries — and at Netflix's scale, the performance consequences are measurable and significant.
🔬
Netflix's investigation, internally called Mount Mayhem , was triggered by observing that certain container workloads performed below their expected throughput under production load — despite having sufficient CPU cores allocated. The gap between expected and observed performance pointed to a hardware-level efficiency issue rather than an application-level bug.
Netflix runs its container workloads on Titus (Netflix's in-house container orchestration platform built on top of Apache Mesos, used to run containerized workloads across Netflix's AWS fleet), the company's internal container orchestration platform. Titus allocates CPU cores to containers and manages scheduling across Netflix's fleet of EC2 instances. The Mount Mayhem investigation revealed that even when a container had adequate CPU allocation, the placement of those CPUs across the host machine's CPU topology (the physical and logical organization of CPU cores, cache hierarchies, and memory domains on a multi-socket or multi-die server) could dramatically affect performance. A workload allocated 4 cores on a single NUMA node performs very differently from the same workload allocated 4 cores spread across multiple NUMA nodes — even if the raw core count is identical.
THE CPU TOPOLOGY PROBLEM
Modern high-core-count CPUs (64, 96, 128+ cores) achieve their core counts by connecting multiple physical dies on a package. Each die has its own L3 cache. Cores on the same die share their L3 cache efficiently. Cores on different dies must communicate through slower inter-die interconnects. When an application's threads are scheduled on cores from different dies, they can't efficiently share cached data — triggering expensive cross-die or cross-NUMA memory accesses. The CPU is physically working against itself.
Problem
Container Throughput Below Hardware Expectation
Certain Netflix container workloads showed throughput below what the allocated CPU resources should deliver. Benchmarks on isolated machines showed good performance; production deployments on shared multi-tenant hosts showed degraded performance. The gap suggested a scheduling or placement issue rather than an application bug.
Cause
CPU Scheduling Across NUMA and Die Boundaries
The Linux kernel's CFS scheduler allocates cores without necessarily respecting CPU die and NUMA boundaries. A container allocated 4 cores might receive cores from 2 different physical dies, or 2 different NUMA nodes. Cross-die and cross-NUMA communication is significantly slower than intra-die communication, degrading cache efficiency and increasing memory access latency.
Solution
Topology-Aware Scheduling in Titus
Netflix experimented with CPU pinning and topology-aware scheduling in Titus — allocating cores to containers from the same physical die and NUMA node wherever possible. This kept container workloads' cache working sets on a single die's L3 cache, eliminating cross-die cache misses.
Result
Measurable Throughput Improvement for Topology-Sensitive Workloads
Topology-aware scheduling produced measurable throughput improvements for workloads that were sensitive to CPU cache locality — particularly high-throughput, latency-sensitive services. The investigation produced insights applicable across Netflix's Titus fleet and contributed to best practices for container scheduling on modern multi-die CPUs.
ℹ️
Why Benchmarks Don't Catch This
CPU topology issues are particularly insidious because standard benchmarks often miss them. A benchmark running on a dedicated machine with exclusive CPU allocation naturally gets cores from the same die, because there's no competing workload to fragment the allocation. A production multi-tenant host with many containers running simultaneously may fragment core allocations across dies, exposing the topology sensitivity that benchmarks never saw. Performance testing on dedicated hardware systematically underestimates topology-related performance risks in multi-tenant production.
The L3 cache (the largest on-chip cache memory, shared among all cores within a physical die — typically 32-128MB depending on the CPU — used to reduce expensive main memory accesses by keeping frequently-used data close to the processors) is the critical resource in this story. When threads on different cores access the same data, the L3 cache is where they can find it without going to main memory. But L3 caches are per-die — cores on different physical dies don't share an L3 cache. If a container's threads are split across dies, their shared data must travel through inter-die interconnects (the high-speed but slower-than-L3-cache links connecting multiple physical dies on a modern CPU package — examples include AMD's Infinity Fabric and Intel's Ring Bus), which are significantly slower than intra-die L3 cache access. The result: higher cache miss rates, more main memory accesses, and lower effective throughput — even with the same number of CPU cores available.
🏔️
Why 'Mount Mayhem'?
Netflix has a tradition of giving investigation projects evocative internal names. Mount Mayhem captured the sense of scaling up to a problem that turned out to be architecturally complex — the investigation started as a performance anomaly and revealed deep hardware-software interaction dynamics that required expertise in CPU microarchitecture, Linux kernel scheduling, and distributed systems to fully understand.
⚠️
Multi-Tenant Host Complexity
Netflix's Titus platform runs many containers on each host simultaneously. In a multi-tenant environment, CPU allocation decisions for one container affect the topology available for all other containers on the same host. If Container A is allocated cores 0-3 (all on Die 1), Container B must take cores from Die 2, creating NUMA pressure for B. Topology-aware scheduling must be global — considering all containers on the host simultaneously — not local to each individual container allocation.
ℹ️
The Titus Platform Context
Netflix Titus is the company's internal container orchestration platform, built on Apache Mesos and managing containerized workloads across Netflix's entire AWS fleet. Titus handles resource allocation, scheduling, and lifecycle management for hundreds of thousands of containers running Netflix's services — from the API layer to encoding pipelines to data processing. The Mount Mayhem investigation happened at the Titus scheduling layer: improving how Titus allocates CPU resources to containers on physical hosts.
❌
The Multi-Tenant Fragmentation Effect
In a dedicated single-tenant host, a container requesting 4 cores naturally gets them from a single die because there's no competition for cores. In Netflix's multi-tenant production environment, multiple containers compete for cores on the same host simultaneously. The first container might claim cores 0-3 (Die 1). The second claims 4-7 (still Die 1). The third gets 24-27 — which are on Die 2, forcing cross-die allocation. Multi-tenancy makes topology fragmentation an emergent property that doesn't appear in single-container benchmarks.
HARDWARE TOPOLOGY DOCUMENTATION GAP
One finding from the Mount Mayhem investigation: detailed CPU topology information is not always easily accessible within cloud VMs. AWS EC2 instances expose varying levels of hardware topology information depending on instance type and generation. Netflix's Titus integration reads available topology data (via /proc/cpuinfo, dmidecode, and lscpu) and builds a best-effort topology model — but for some instance types, the model is incomplete. This is an industry gap that cloud providers have been slowly addressing as multi-die CPUs become the norm.
The Fix
Topology-Aware Scheduling in Netflix Titus
The fix for CPU topology violations was implemented at the container scheduler level in Titus. Rather than treating all cores as equivalent, the topology-aware scheduler models the host's CPU topology — which cores belong to which die, which dies share L3 cache, which cores are in which NUMA node — and uses this model to make allocation decisions that minimize cross-die and cross-NUMA placements. For latency-sensitive services, this means preferentially allocating cores from the same die. For batch workloads with lower latency sensitivity, the scheduler can be more flexible with topology to improve overall utilization.
- Die-local — Core allocation strategy for topology-sensitive workloads — placing all container threads on cores from the same physical die to maximize L3 cache sharing
- NUMA-aware — Memory allocation strategy paired with CPU pinning — ensuring memory allocated by a container is on the NUMA node closest to the container's CPU cores
- Measurable — Throughput improvement from topology-aware scheduling for cache-sensitive workloads — the improvement varies by workload type but is consistent for high-throughput services
- Per-workload — Scheduling policy applied based on workload characteristics — not all workloads benefit equally from topology-aware scheduling; policy is tuned per service type
# Simplified topology-aware CPU allocation logic
# Real implementation in Netflix Titus uses Go and integrates with Linux cgroups
class TopologyAwareCPUAllocator:
def __init__ (self, host_topology: CPUTopology):
# host_topology describes: dies, cores per die, NUMA nodes, L3 cache sizes
self.topology = host_topology
def allocate_cores(
self,
num_cores: int,
workload_type: WorkloadType
) -> list[int]:
if workload_type == WorkloadType.LATENCY_SENSITIVE:
# Prefer cores from a single die — maximize L3 cache sharing
return self._allocate_from_single_die(num_cores)
elif workload_type == WorkloadType.BATCH:
# Allow cross-die allocation — prioritize utilization over locality
return self._allocate_for_utilization(num_cores)
else:
# Default: try single die, fall back to NUMA-aware cross-die
cores = self._allocate_from_single_die(num_cores)
if cores is None:
cores = self._allocate_numa_aware(num_cores)
return cores
def _allocate_from_single_die(self, num_cores: int) -> list[int] | None:
for die in self.topology.dies:
available = die.available_cores()
if len(available) >= num_cores:
return available[:num_cores] # all from same die
return None # no single die has enough free cores
def _allocate_numa_aware(self, num_cores: int) -> list[int]:
# Cross-die but NUMA-aware: prefer cores on same NUMA node
# Avoids the worst penalty (cross-NUMA) when cross-die is unavoidable
return self.topology.allocate_numa_local(num_cores)
THE WORKLOAD SENSITIVITY DIMENSION
Not all workloads benefit equally from topology-aware scheduling. High-throughput, low-latency services (API servers, caching layers, media serving) benefit significantly because they have large working sets that benefit from shared L3 cache. Batch processing workloads with large independent tasks often don't benefit — their working sets are too large for L3 cache regardless of core placement. Netflix's Titus scheduler applies topology-aware policies selectively based on workload characteristics rather than universally.
ℹ️
NUMA Memory Allocation Paired with CPU Pinning
CPU topology-aware scheduling alone is insufficient if memory allocation doesn't follow the same locality rules. When container threads run on Die 1 cores but their memory is allocated on the NUMA node associated with Die 2, memory accesses still pay the NUMA penalty. Effective topology-aware scheduling requires paired NUMA-aware memory allocation — using Linux kernel mechanisms like
numactlormbindto bind container memory allocations to the NUMA node local to the allocated CPU cores.✅
Implications for EC2 Instance Selection
The Mount Mayhem investigation also produced insights for EC2 instance type selection. Different instance families have different CPU topologies — some have multiple physical dies with separate L3 caches, others have a single die with shared L3. For topology-sensitive workloads, single-die instances provide predictable cache locality without scheduling complexity. This insight influences Netflix's instance type selection for specific workload classes — the optimal instance for a topology-sensitive service may not be the one with the most raw core count.
✅
LLC Cache Affinity: The Technical Mechanism
The Linux kernel provides CPU affinity controls via cgroups cpuset — the mechanism that restricts which cores a container can use. Netflix's Titus scheduler, when making topology-aware placements, sets the cgroup cpuset to cores on a single die and does not allow the Linux CFS to migrate threads beyond that cpuset. The cpuset becomes the enforcement mechanism: the container's threads can only run on the assigned die, guaranteeing L3 cache locality for the container's lifetime on that host.
⚠️
The Utilization Trade-Off
Topology-aware scheduling improves per-container performance but can reduce overall host utilization. If Die 1 has 12 cores and only 10 are free, a topology-aware scheduler won't place a 4-core container there even though there are cores available — it needs 4 contiguous die-local cores. This leaves cores underutilized to preserve topology quality. Netflix tunes this trade-off per workload class: topology-sensitive services get strict die-local allocation; batch workloads accept cross-die placement in exchange for higher host utilization.
Architecture
Understanding the CPU topology problem requires a brief grounding in modern CPU physical architecture. A 96-core server CPU is not 96 independent identical cores — it's typically 4–8 physical dies, each with 12–24 cores sharing an L3 cache. The dies are connected via a high-speed but non-L3 interconnect. When the operating system schedules threads, it sees 96 logical CPUs and can place threads anywhere. The hardware, however, provides very different performance depending on whether threads land on the same die or different dies.
Modern CPU Physical Topology: Two Dies, Four NUMA Nodes
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
Container Allocation: Topology-Unaware vs Topology-Aware
View interactive diagram on TechLogStack →
Interactive diagram available on TechLogStack (link above).
LINUX SCHEDULER VS HARDWARE REALITY
The Linux Completely Fair Scheduler (CFS) was designed for a world where CPU cores were equivalent. On modern multi-die CPUs, they are not. CFS's load balancing can migrate threads between cores to equalize load — inadvertently crossing die boundaries and destroying cache locality in the process. Netflix's topology-aware scheduling in Titus works above the CFS level , making initial placement decisions that minimize cross-die allocations, and using CPU affinity pinning to prevent CFS from subsequently migrating threads across die boundaries.
ℹ️
The Cloud Provider Dimension
AWS EC2 instance types vary in their physical CPU topology. Some instances expose underlying hardware topology via DMI/SMBIOS data that can be read from within a VM. Netflix's Titus integration reads this topology information to make informed allocation decisions. Not all instance types expose this information reliably , which constrains topology-aware scheduling to instance families where the physical topology is knowable. This is an area where cloud provider documentation and tooling has been improving as CPU core counts grow.
💡
AMD EPYC: The Multi-Die Extreme Case
The CPU topology problem is particularly pronounced on AMD EPYC processors (widely used in AWS EC2 instances). Modern EPYC chips use AMD's chiplet design, with 8+ physical compute dies (CCDs) per CPU, each with 8 cores sharing 32MB of L3 cache. A 64-core EPYC CPU has 8 separate L3 cache domains. Cross-CCD communication goes through the AMD Infinity Fabric — fast, but not as fast as intra-CCD L3 access. Netflix's Mount Mayhem work was partly motivated by expanding use of EPYC-based instances where the topology effects are most pronounced.
Lessons
Mount Mayhem's lessons bridge hardware microarchitecture and distributed systems — a combination that's rare but important as core counts on modern servers grow into the hundreds.
- 01. Modern CPUs are not flat. 96 cores on one CPU are not 96 equivalent resources — they're organized in a hierarchy of dies, NUMA nodes, and L3 caches that dramatically affect performance depending on how workloads are placed. Container schedulers that treat all cores as equivalent leave performance on the table for topology-sensitive workloads.
- 02. NUMA (Non-Uniform Memory Access — memory access speed depends on the physical proximity of the memory to the CPU core accessing it) effects compound with L3 cache locality effects. Topology-aware scheduling must address both: allocate cores from the same die, and allocate memory from the NUMA node local to those cores. Fixing CPU placement without fixing memory placement leaves half the penalty in place.
- 03. Benchmarks on dedicated hardware systematically underestimate topology-sensitivity. A benchmark with exclusive machine access gets natural die locality. Production multi-tenant hosts fragment core allocations across dies under load. Always validate performance on production-equivalent multi-tenant hosts, not just isolated benchmark environments.
- 04. Topology-aware scheduling should be workload-aware, not universally applied. Latency-sensitive services with shared working sets benefit significantly from die-local core allocation. Batch workloads with large independent tasks often benefit less. Apply topology-aware policies where they deliver measurable improvements and use flexible allocation elsewhere to maximize host utilization.
- 05. CPU affinity pinning prevents the Linux scheduler from undoing topology-aware placement. Without pinning, CFS load balancing can migrate threads across die boundaries to equalize load — destroying the cache locality that topology-aware placement achieved. Topology-aware scheduling requires affinity pinning to be durable under load.
AS CORE COUNTS GROW, THIS MATTERS MORE
The CPU topology problem gets worse as server CPU core counts increase. A 16-core CPU might be a single die. A 96-core CPU is almost certainly multiple dies. A 192-core CPU will have even more complex topology. As cloud providers offer larger instance types and as hardware vendors continue scaling core counts via multi-die packaging, topology-aware scheduling becomes increasingly important for high-performance production workloads.
ℹ️
The Encoding Pipeline Application
Netflix's video encoding pipelines are a particularly topology-sensitive workload. Encoding involves shared codec state across multiple threads working on different sections of the same video. When those threads are split across dies, the shared state must cross the inter-die interconnect on every access. Topology-aware scheduling for encoding workloads produced some of the most measurable improvements in the Mount Mayhem investigation — the shared-working-set nature of encoding makes it naturally sensitive to L3 cache locality.
Netflix's containers were assigned 4 CPU cores and got 4 CPU cores, but two of them had to talk across a hardware bus to share data with the other two, which is the infrastructure equivalent of having a meeting where half the attendees are on speakerphone from another room.
TechLogStack — built at scale, broken in public, rebuilt by engineers
This case is a plain-English retelling of publicly available engineering material.
Read the full case on TechLogStack → (interactive diagrams, source links, and the full reader experience).
Top comments (0)