TechLogStack

Posted on May 20 • Originally published at techlogstack.com on May 17

Netflix's Containers Were Fighting Their Own CPUs — and Losing

#performance #backend #devops #webdev

Mount Mayhem — Netflix's internal investigation into container throughput below hardware expectation
CPU topology — the real culprit: cores allocated from different physical dies can't share L3 cache
NUMA effects — memory access speed depends on physical proximity; cross-NUMA is measurably slower
Multi-tenant fragmentation — containers compete for cores; topology violations emerge under shared load
Benchmarks miss it — dedicated hardware gives natural die locality; production multi-tenant hosts don't
Fix: topology-aware scheduling in Netflix Titus — allocate die-local cores, pin with cgroup cpuset

Netflix ran millions of containers per day on modern multi-core CPUs. The containers performed well on benchmarks. In production, under certain workloads, they were mysteriously slower than expected — slower than the hardware should have allowed. The culprit was CPU topology: the operating system was scheduling container workloads in ways that violated modern CPU cache architecture. They called the investigation 'Mount Mayhem.'

The Story

Modern CPUs are not the flat, uniform compute resources that operating system schedulers often treat them as. A 96-core server CPU is actually a hierarchical structure: multiple physical processor dies, each containing cores that share L3 cache, connected to memory via NUMA (Non-Uniform Memory Access — a CPU architecture where memory access speed depends on the physical proximity of the memory module to the CPU core accessing it, with local memory being faster than remote memory) domains. Two cores on the same die can share cache and access memory very quickly. Two cores on different dies, or different NUMA nodes, pay a penalty for cross-die communication and remote memory access. When the Linux kernel schedules container workloads across these cores, it doesn't always respect these hardware boundaries — and at Netflix's scale, the performance consequences are measurable and significant.

Netflix runs its container workloads on Titus (Netflix's in-house container orchestration platform built on top of Apache Mesos, used to run containerised workloads across Netflix's AWS fleet). The Mount Mayhem investigation revealed that even when a container had adequate CPU allocation, the placement of those CPUs across the host machine's CPU topology (the physical and logical organisation of CPU cores, cache hierarchies, and memory domains on a multi-socket or multi-die server) could dramatically affect performance. A workload allocated 4 cores on a single NUMA node performs very differently from the same workload allocated 4 cores spread across multiple NUMA nodes — even if the raw core count is identical.

The CPU Topology Problem

Modern high-core-count CPUs (64, 96, 128+ cores) achieve their core counts by connecting multiple physical dies on a package. Each die has its own L3 cache. Cores on the same die share their L3 cache efficiently. Cores on different dies must communicate through slower inter-die interconnects (the high-speed but slower-than-L3-cache links connecting multiple physical dies on a modern CPU package — examples include AMD's Infinity Fabric and Intel's Ring Bus), which are significantly slower than intra-die L3 cache access. The result: higher cache miss rates, more main memory accesses, and lower effective throughput — even with the same number of CPU cores available. The CPU is physically working against itself.

Problem

Container Throughput Below Hardware Expectation

Certain Netflix container workloads showed throughput below what the allocated CPU resources should deliver. Benchmarks on isolated machines showed good performance; production deployments on shared multi-tenant hosts showed degraded performance. The gap suggested a scheduling or placement issue rather than an application bug.

Cause

CPU Scheduling Across NUMA and Die Boundaries

The Linux kernel's CFS scheduler allocates cores without necessarily respecting CPU die and NUMA boundaries. A container allocated 4 cores might receive cores from 2 different physical dies, or 2 different NUMA nodes. Cross-die and cross-NUMA communication is significantly slower than intra-die communication, degrading cache efficiency and increasing memory access latency.

Solution

Topology-Aware Scheduling in Titus

Netflix experimented with CPU pinning and topology-aware scheduling in Titus — allocating cores to containers from the same physical die and NUMA node wherever possible. This kept container workloads' cache working sets on a single die's L3 cache, eliminating cross-die cache misses.

Result

Measurable Throughput Improvement for Topology-Sensitive Workloads

Topology-aware scheduling produced measurable throughput improvements for workloads sensitive to CPU cache locality — particularly high-throughput, latency-sensitive services. The investigation produced insights applicable across Netflix's Titus fleet.

The Fix

Topology-Aware Scheduling in Netflix Titus

The fix was implemented at the container scheduler level in Titus. Rather than treating all cores as equivalent, the topology-aware scheduler models the host's CPU topology — which cores belong to which die, which dies share L3 cache, which cores are in which NUMA node — and uses this model to make allocation decisions that minimise cross-die and cross-NUMA placements.

Die-local — core allocation strategy for topology-sensitive workloads; all container threads on cores from the same physical die to maximise L3 cache sharing
NUMA-aware — memory allocation paired with CPU pinning; container memory allocated on the NUMA node closest to its CPU cores
Per-workload — scheduling policy applied based on workload characteristics; not all workloads benefit equally
cpuset pinning — cgroup cpuset used to prevent Linux CFS from migrating threads across die boundaries after placement

# Simplified topology-aware CPU allocation logic
# Real implementation in Netflix Titus uses Go and integrates with Linux cgroups

class TopologyAwareCPUAllocator:
    def __init__(self, host_topology: CPUTopology):
        # host_topology describes: dies, cores per die, NUMA nodes, L3 cache sizes
        self.topology = host_topology

    def allocate_cores(
        self,
        num_cores: int,
        workload_type: WorkloadType
    ) -> list[int]:
        if workload_type == WorkloadType.LATENCY_SENSITIVE:
            # Prefer cores from a single die — maximise L3 cache sharing
            # High-throughput API servers, caching layers, media serving
            return self._allocate_from_single_die(num_cores)

        elif workload_type == WorkloadType.BATCH:
            # Allow cross-die allocation — prioritise utilisation over locality
            # Large independent tasks; working sets exceed L3 cache anyway
            return self._allocate_for_utilization(num_cores)

        else:
            # Default: try single die, fall back to NUMA-aware cross-die
            cores = self._allocate_from_single_die(num_cores)
            if cores is None:
                # No single die has enough free cores
                # Cross-die but stay NUMA-local to avoid the worst penalty
                cores = self._allocate_numa_aware(num_cores)
            return cores

    def _allocate_from_single_die(self, num_cores: int) -> list[int] | None:
        for die in self.topology.dies:
            available = die.available_cores()
            if len(available) >= num_cores:
                # All cores from same die — L3 cache shared, no inter-die penalty
                return available[:num_cores]
        return None  # no single die has enough; caller falls back to NUMA-aware

    def _allocate_numa_aware(self, num_cores: int) -> list[int]:
        # Cross-die but NUMA-aware: prefer cores on same NUMA node
        # Avoids cross-NUMA memory penalty when cross-die is unavoidable
        return self.topology.allocate_numa_local(num_cores)

Why Benchmarks Don't Catch This

CPU topology issues are particularly insidious because standard benchmarks miss them. A benchmark running on a dedicated machine with exclusive CPU allocation naturally gets cores from the same die, because there's no competing workload to fragment the allocation. A production multi-tenant host with many containers running simultaneously may fragment core allocations across dies under load. Performance testing on dedicated hardware systematically underestimates topology-related performance risks in multi-tenant production.

The L3 cache (the largest on-chip cache memory, shared among all cores within a physical die — typically 32–128MB — used to reduce expensive main memory accesses) is the critical resource in this story. When threads on different cores access the same data, the L3 cache is where they can find it without going to main memory. But L3 caches are per-die — cores on different physical dies don't share one. If a container's threads are split across dies, their shared data must travel through inter-die interconnects, which are significantly slower than intra-die L3 cache access.

Multi-tenant fragmentation: why this only appears in production

In a dedicated single-tenant host, a container requesting 4 cores naturally gets them from a single die because there's no competition for cores. In Netflix's multi-tenant production environment, multiple containers compete for cores on the same host simultaneously. The first container might claim cores 0–3 (Die 1). The second claims 4–7 (still Die 1). The third gets 24–27 — which are on Die 2, forcing cross-die allocation. Multi-tenancy makes topology fragmentation an emergent property that doesn't appear in single-container benchmarks.

NUMA memory allocation paired with CPU pinning

CPU topology-aware scheduling alone is insufficient if memory allocation doesn't follow the same locality rules. When container threads run on Die 1 cores but their memory is allocated on the NUMA node associated with Die 2, memory accesses still pay the NUMA penalty. Effective topology-aware scheduling requires paired NUMA-aware memory allocation — using Linux kernel mechanisms like numactl or mbind to bind container memory allocations to the NUMA node local to the allocated CPU cores.

The utilisation trade-off

Topology-aware scheduling improves per-container performance but can reduce overall host utilisation. If Die 1 has 12 cores and only 10 are free, a topology-aware scheduler won't place a 4-core container there even though there are cores available — it needs 4 die-local cores. This leaves cores underutilised to preserve topology quality. Netflix tunes this trade-off per workload class: topology-sensitive services get strict die-local allocation; batch workloads accept cross-die placement in exchange for higher host utilisation.

Architecture

Understanding the CPU topology problem requires a brief grounding in modern CPU physical architecture. A 96-core server CPU is typically 4–8 physical dies, each with 12–24 cores sharing an L3 cache. The dies are connected via a high-speed but non-L3 interconnect. When the operating system schedules threads, it sees 96 logical CPUs and can place threads anywhere. The hardware provides very different performance depending on whether threads land on the same die or different dies.

Modern CPU Physical Topology: Two Dies, Four NUMA Nodes

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Container Allocation: Topology-Unaware vs Topology-Aware

View interactive diagram on TechLogStack →

Interactive diagram available on TechLogStack (link above).

Linux Scheduler vs Hardware Reality

The Linux Completely Fair Scheduler (CFS) was designed for a world where CPU cores were equivalent. On modern multi-die CPUs, they are not. CFS's load balancing can migrate threads between cores to equalise load — inadvertently crossing die boundaries and destroying cache locality in the process. Netflix's topology-aware scheduling in Titus works above the CFS level, making initial placement decisions that minimise cross-die allocations, and using CPU affinity pinning (via cgroup cpuset) to prevent CFS from subsequently migrating threads across die boundaries.

Lessons

Modern CPUs are not flat. 96 cores on one CPU are not 96 equivalent resources — they're organised in a hierarchy of dies, NUMA nodes, and L3 caches that dramatically affect performance depending on how workloads are placed. Container schedulers that treat all cores as equivalent leave performance on the table for topology-sensitive workloads.
NUMA effects compound with L3 cache locality effects. Topology-aware scheduling must address both: allocate cores from the same die, and allocate memory from the NUMA node local to those cores. Fixing CPU placement without fixing memory placement leaves half the penalty in place.
Benchmarks on dedicated hardware systematically underestimate topology-sensitivity. A benchmark with exclusive machine access gets natural die locality. Production multi-tenant hosts fragment core allocations across dies under load. Always validate performance on production-equivalent multi-tenant hosts, not just isolated benchmark environments.
Topology-aware scheduling should be workload-aware, not universally applied. Latency-sensitive services with shared working sets benefit significantly from die-local core allocation. Batch workloads with large independent tasks often benefit less. Apply topology-aware policies where they deliver measurable improvements and use flexible allocation elsewhere to maximise host utilisation.
CPU affinity pinning prevents the Linux scheduler from undoing topology-aware placement. Without pinning, CFS load balancing can migrate threads across die boundaries to equalise load — destroying the cache locality that topology-aware placement achieved. Topology-aware scheduling requires affinity pinning to be durable under load.

Engineering Glossary

cgroup cpuset — a Linux kernel mechanism that restricts which CPU cores a process or container can use. Netflix's Titus topology-aware scheduler sets the cpuset to cores on a single die and uses it as an enforcement mechanism to prevent CFS from migrating threads across die boundaries.

CFS (Completely Fair Scheduler) — the default Linux kernel thread scheduler, designed for a world where CPU cores are equivalent. On modern multi-die CPUs, CFS can migrate threads across die boundaries while load balancing, inadvertently destroying cache locality.

CPU topology — the physical and logical organisation of CPU cores, cache hierarchies, and memory domains on a multi-socket or multi-die server. Includes: which cores share which L3 cache, which dies are on which NUMA nodes, and which interconnects link the dies.

Inter-die interconnect — the high-speed but non-L3-cache link connecting multiple physical dies on a modern CPU package. Slower than intra-die L3 cache access. Examples: AMD Infinity Fabric, Intel Ring Bus. The performance gap between intra-die and cross-die access is the root of the topology problem.

L3 cache — the largest on-chip cache memory, shared among all cores within a physical die (typically 32–128MB). Reduces expensive main memory accesses. Per-die — cores on different dies don't share L3 cache, meaning cross-die thread communication requires inter-die interconnect or main memory.

NUMA (Non-Uniform Memory Access) — a CPU architecture where memory access speed depends on the physical proximity of the memory module to the CPU core accessing it. Local memory (on the same NUMA node as the core) is faster than remote memory (on a different NUMA node). Must be addressed alongside CPU topology for full performance improvement.

Titus — Netflix's in-house container orchestration platform, built on Apache Mesos, managing containerised workloads across Netflix's AWS fleet. The Mount Mayhem topology-aware scheduling improvements were implemented at the Titus scheduling layer.

This case is a plain-English retelling of publicly available engineering material.

Read the full case on TechLogStack →

(Interactive diagrams, source links, and the full reader experience)

TechLogStack — built at scale, broken in public, rebuilt by engineers.

DEV Community