DEV Community

Hao Jiang
Hao Jiang

Posted on • Originally published at Medium

Solving the Noisy Neighbor Problem: A Multi-Year Journey to IO Isolation on Kubernetes

TL;DR: Vanilla Kubernetes is IO-unaware, causing noisy neighbors to hang Docker daemons via PLEG timeouts. We upgraded thousands of nodes to K8s v1.22, enabled cgroup v2, and partnered with Intel to build a custom scheduler plugin and node agent that throttles disk (io.max) and network (Linux TC) bursts. Result: Validated technical readiness for safe stateful K8s migrations.

Section 1: Introduction

Why I'm Writing This

This is the story of a multi-year effort to solve a large-scale data company's noisy neighbor problem on Kubernetes—a fundamental limitation that blocked the migration of critical stateful workloads to our platform.

By the end of my tenure, we had validated a solution through a partnership with Intel. The test cluster proved the approach worked, four cross-functional stakeholders approved the design, and a clear rollback strategy ensured operational safety.

I'm writing this blog to preserve the knowledge, recognize the collaborative effort, and help others facing similar challenges. Multi-year infrastructure transformations are hard. The biggest challenges aren't always technical—they're about identifying the right problem, convincing people the solution is necessary, and getting teams to work together. This writeup documents what we learned so the effort doesn't get lost.

The Context: A Platform Under Pressure

The company's Kubernetes platform isn't just running stateless web applications—it's the engine for massive, stateful data systems. At this scale, vanilla Kubernetes starts to break down. We maintained our own internal fork with some custom patches to survive our operational requirements.

Solving the noisy neighbor problem required a multi-year transformation:

  1. Phase 1: The Foundation - Upgrade Kubernetes from v1.18 to v1.22 to enable cgroup v2 support (required for IO throttling)
  2. Phase 2: The Solution - Integrate IOIsolation to provide IO-aware scheduling and preventive isolation

This is the story of that journey.


Section 2: The Problem - Kubernetes is IO-Unaware

The Fundamental Gap

Kubernetes scheduling has a fundamental limitation: it's IO-unaware. The scheduler considers CPU and memory when placing pods, but completely ignores IO capacity:

  • No visibility into node IO bandwidth (disk or network)
  • No way for pods to request IO resources
  • No mechanism to prevent multiple high-IO workloads from landing on the same node

This isn't a bug—it's a design gap. Kubernetes assumes IO resources are either infinite or managed externally. For many workloads, this assumption is fine. Given our large scale and workload mix, it was a critical problem.

The Symptoms

The IO-unaware scheduling led to recurring production incidents:

  1. Multiple IO-hungry pods would be scheduled on the same node (no IO capacity awareness)
  2. These pods would consume excessive disk or network bandwidth
  3. Kernel CPU would spike due to interrupt handling and context switching
  4. The Docker daemon would hang, unable to respond to requests
  5. Kubelet's Pod Lifecycle Event Generator (PLEG) would timeout trying to reconcile container state
  6. Nodes would be marked NotReady, triggering cascading alerts and pod evictions

A particularly insidious failure mode: If a high-IO workload caused a kernel-level IO hang, the Docker daemon would often block, triggering a PLEG timeout. This made the Kubelet think the node was dead, even if CPU was at 0%.

Multiple teams were filing incidents. Internal engineering jobs were disrupted. But the most critical impact was strategic: we couldn't migrate database-like applications to Kubernetes because the noisy neighbor risk was too high.

The Investigation

  • IRQ counts spiking during IO-heavy workload activity
  • Excessive context switching correlating with IO load
  • Docker daemon behavior and recovery patterns

When the Docker daemon hung, restarting it didn't help. In fact, it often made things worse. When system CPU was high, the new Docker daemon could fail to start. Reactive recovery (restarting Docker) was unreliable.

Testing the Xen Hypothesis

I initially suspected AWS's Xen hypervisor might be the culprit. On older instance types (m4, c4, r4), Xen's dom0 handles disk I/O virtualization in software, which can cause severe CPU steal time during high I/O operations.

So I tested the same workloads on Nitro instances (c5, r5), which offload EBS I/O and network virtualization to dedicated hardware cards. There's no dom0 stealing CPU cycles.

The same failure occurred on Nitro. This confirmed what I expected: hardware offloading cannot fix OS-level kernel bottlenecks. Even with Nitro's hardware acceleration, the kernel still processes I/O interrupts, context switches, and completion handlers. When noisy neighbors saturated I/O, kernel CPU spiked, and the Docker daemon couldn't recover.

The IO-Blindness Trap

  • The Scheduler Issue: Standard K8s is "IO-Unaware." It treats a node with 10 pods and a node with 50 pods as equally "available" if the CPU/Mem metrics are the same, completely ignoring the IOPS/Throughput saturation on the underlying EBS/Disk.
  • The Stack Issue (Dockershim): High IO workloads don't just slow down neighbors; they saturate the Docker daemon. Because Dockershim/Docker is a centralized bottleneck, a single pod's IO burst can cause the daemon to hang, leading to a NodeNotReady state.
  • The cgroup v1 Limitation: You can't fix this with software tweaks because cgroup v1 cannot track or throttle buffered IO (writes that go to the page cache first).

Why Containerd Helps (But Isn't Enough)

Moving to containerd (standard in Kubernetes v1.22+) removes Dockershim and the heavy Docker daemon. Containerd is a "lean" runtime—significantly more resilient to being locked up by a single pod's behavior. This improves node stability.

But even with containerd, the underlying Linux kernel still has the same limitation under cgroups v1. Containerd makes the management layer (the runtime) more stable, but it does nothing to protect the data layer (the performance of neighboring pods).

Containerd reduces the risk of node-level failures, but doesn't solve the noisy neighbor problem.

The Solution Requirements

Solving this required two things:

  1. IO-aware scheduling: Place pods based on available IO capacity (scheduler plugin)
  2. Preventive isolation: Throttle buffered I/O before it saturates the kernel (cgroups v2)

Section 3: The Foundation - Preparing the Platform

IOIsolation had strict technical prerequisites that required significant preparation work:

1. Kubernetes v1.22 Minimum

IOIsolation required Kubernetes v1.22 or higher with systemd as the cgroup driver.

2. Custom AMI with cgroup v2 Enabled

We built a custom Amazon Machine Image (AMI) with the OS kernel configured to enable cgroup v2 by default. This required:

  • Setting kernel boot parameters to enable cgroup v2
  • Validating compatibility with existing workloads

3. Dual Filesystem Compatibility

Another requirement was ensuring the IOIsolation code worked with both cgroup v1 and v2 filesystems. Why?

  • We couldn't flip the entire fleet to cgroup v2 overnight
  • Different node pools might be at different migration stages
  • Rollback scenarios required v1 compatibility

Section 4: The Solution - IO Isolation Architecture

Part A: System Overview

IOIsolation provides IO-aware scheduling and enforcement through cgroup v2. The system consists of four main components working together to ensure pods get the IO bandwidth they need while preventing noisy neighbors from impacting others.

Figure 1: High-level design of the IOIsolation framework. The system integrates a custom Kubernetes Scheduler Plugin with a node-level Enforcement Agent to solve the "noisy neighbor" problem. It utilizes CRDs for state management, cgroup v2 for throttling, and a specialized Aggregator to maintain scalability across thousands of nodes.

The Four Components:

  1. Scheduler Plugin - Makes IO-aware pod placement decisions
  2. Node Agent - Monitors IO usage and enforces limits via cgroups
  3. Custom Resource Definitions (CRDs) - Configuration and state management
  4. Aggregator (optional) - Centralized monitoring and coordination

As shown in Figure 1, the Control Plane contains the Disk IO Scheduler Plugin and Resource IO Aggregator, while each Worker Node runs the Node IO Agent (which handles eBPF monitoring, cgroup enforcement, and NRI integration). The CRDs (NodeStaticIOInfo, NodeIOStatus) flow between these layers to maintain consistent state.

Let's examine each component and how they work together.


Component 1: Scheduler Plugin

The scheduler plugin extends Kubernetes' default scheduler with IO awareness. When a pod requests IO resources, the scheduler ensures it's placed on a node with sufficient available bandwidth.

How It Works:

1. Pod Creation

apiVersion: v1
kind: Pod
metadata:
  annotations:
    ioi.intel.com/disk-bandwidth: "100MB/s read, 50MB/s write"
    ioi.intel.com/network-bandwidth: "200Mbps"
Enter fullscreen mode Exit fullscreen mode

2. Scheduler Plugin Reads Annotations

  • Parses IO requirements from pod annotations
  • Determines IO class (GA = Guaranteed Allocation, BE = Best Effort)

3. Filter Phase

  • Eliminates nodes without sufficient IO capacity
  • Checks NodeIOStatus CRD for each node's available bandwidth

4. Score Phase (LeastAllocated Strategy)

  • Ranks remaining nodes by available IO capacity
  • The LeastAllocated strategy acts as a natural dampener, spreading the pods out and preventing a "thundering herd" from piling onto a single node before the CRDs can update.

5. Bind Phase

  • Assigns pod to selected node
  • Updates node's reserved bandwidth

The Scheduler's View of Node Capacity:

The scheduler maintains a local cache of node IO capacity by watching NodeIOStatus CRDs:

// Simplified example
type NodeIOStatus struct {
    NodeName string
    DisksStatus map[string]DiskStatus {
        "/dev/sda": {
            GA: {  // Guaranteed Allocation pool
                In:  500,  // 500 MB/s read available
                Out: 400,  // 400 MB/s write available
            },
            BE: {  // Best Effort pool
                In:  100,  // 100 MB/s read available
                Out: 80,   // 80 MB/s write available
            }
        }
    }
    NetworkStatus: {
        GA: { In: 800, Out: 800 },  // 800 Mbps available
        BE: { In: 100, Out: 100 },  // 100 Mbps available
    }
}
Enter fullscreen mode Exit fullscreen mode

The scheduler reads node capacity from CRDs via standard Kubernetes Informers (watch mechanism).


Component 2: Node Agent

The node agent is the "brain" of the system, running as a DaemonSet on each node. It works with ioi-service to monitor and enforce IO limits.

Note: In Figure 1, the "Node IO Agent" box represents the combination of two components working together: the Node Agent (running as a Kubernetes DaemonSet) and the ioi-service (running as a systemd service with root privileges). This separation allows the DaemonSet to handle Kubernetes integration while the privileged systemd service performs low-level cgroup operations.

The ioi-service runs as a privileged system service (runs via systemd) that handles the low-level IO monitoring and enforcement:

  • Monitors IO via eBPF or io.stat (cgroup v2)
  • Writes bandwidth limits to cgroup files
  • Communicates with node-agent via gRPC over Unix socket

The node agent has three main responsibilities:

1. Bandwidth Profiling (Disk Only)

When the node-agent starts, it measures actual disk bandwidth capacity using fio (Flexible IO Tester):

# Example: Test 4K random read on /dev/sda
fio --filename=/mnt/sda/test \
    --direct=1 \
    --rw=randread \
    --bs=4k \
    --size=20G \
    --runtime=60s \
    --output=results.json
Enter fullscreen mode Exit fullscreen mode

The agent tests multiple block sizes (512B, 1K, 4K, 8K, 16K, 32K) for both read and write operations. Results are stored and reused (profiling takes ~10 minutes per disk).

Network bandwidth is simpler: The agent reads the link speed from sysfs (/sys/class/net/eth0/speed) or uses a configured value.

2. Pool Management

The agent divides bandwidth into pools based on admin configuration:

# Admin ConfigMap
diskpools: |
  GA=100   # Guaranteed Allocation: 100% of capacity
  BE=20    # Best Effort: 20% of capacity

networkpools: |
  GA=95    # Guaranteed Allocation: 95% of capacity (950 Mbps)
  BE=5     # Best Effort: 5% of capacity (50 Mbps)
Enter fullscreen mode Exit fullscreen mode

Example calculation:

Disk Capacity (from profiling): 650 MB/s read, 600 MB/s write
↓
GA Pool:  650 MB/s × 100% = 650 MB/s read, 600 MB/s write
BE Pool:  650 MB/s × 20%  = 130 MB/s read, 120 MB/s write
Enter fullscreen mode Exit fullscreen mode

3. Dynamic Bandwidth Enforcement

The node-agent continuously monitors actual IO usage and adjusts limits dynamically. It enforces these limits at two distinct layers:

A. Block Layer Enforcement (cgroups)

Every 2-5 seconds:

  1. Receive bandwidth data from ioi-service (via gRPC)
  2. Calculate actual usage per pod, per QoS class
  3. Recalculate available bandwidth
  4. Update cgroup io.max limits via ioi-service

B. Network Layer Enforcement (tc & ifb)

Because cgroups don't throttle networking, the node agent uses Linux Traffic Control (tc). It creates an Intermediate Functional Block (ifb) device for the pod. This allows us to reliably shape both ingress and egress traffic, ensuring BE pods don't starve GA pods of network bandwidth.

The system dynamically adjusts BE bandwidth based on actual GA/BE usage. For disk IO, BE can be squeezed to almost nothing. For network IO, each BE pod maintains a minimum bandwidth to prevent TCP connection timeouts.


Component 3: Custom Resource Definitions (CRDs)

The system uses three CRDs to manage configuration and state:

1. NodeStaticIOInfo - Static node capacity (from profiling)

apiVersion: ioi.intel.com/v1
kind: NodeStaticIOInfo
metadata:
  name: node1-staticinfo
spec:
  nodeName: node1
  disks:
    - id: "disk-sda"
      path: "/dev/sda"
      capacity:
        read: 650   # MB/s
        write: 600  # MB/s
  network:
    linkSpeed: 1000  # Mbps
Enter fullscreen mode Exit fullscreen mode

2. NodeIOStatus - Dynamic node status (updated frequently)

apiVersion: ioi.intel.com/v1
kind: NodeIOStatus
metadata:
  name: node1-nodeioinfo
status:
  disksStatus:
    disk-sda:
      GA:
        In: 500   # 500 MB/s read available
        Out: 400  # 400 MB/s write available
      BE:
        In: 130   # 130 MB/s read available
        Out: 120  # 120 MB/s write available
  networkStatus:
    GA: { In: 800, Out: 800 }
    BE: { In: 50, Out: 50 }
Enter fullscreen mode Exit fullscreen mode

3. IOIPolicy - IO classes and policies

apiVersion: ioi.intel.com/v1
kind: IOIPolicy
metadata:
  name: default-policy
spec:
  ioClasses:
    - name: "GA"  # Guaranteed Allocation
      priority: 1
    - name: "BE"  # Best Effort
      priority: 2
Enter fullscreen mode Exit fullscreen mode

Component 4: Aggregator (Optional)

The aggregator sits between node-agents and the Kubernetes API server, collecting metrics and batching updates.

Data Flow With Aggregator:

Node-agent (node1) ──┐
Node-agent (node2) ──┤
Node-agent (node3) ──┼──> Aggregator ──> Batch Update ──> NodeIOStatus CRDs
       ...           │
Node-agent (nodeN) ──┘
Enter fullscreen mode Exit fullscreen mode

What it does:

  1. Receives bandwidth metrics from all node-agents via gRPC
  2. Batches updates (every 5 seconds, configurable)
  3. Writes to NodeIOStatus CRDs in Kubernetes API server
  4. Reduces API server write load by ~90% at scale

Data Flow Without Aggregator (Direct Mode):

Node-agent (node1) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node2) ──> Write NodeIOStatus CRD ──> API Server
Node-agent (node3) ──> Write NodeIOStatus CRD ──> API Server
       ...
Enter fullscreen mode Exit fullscreen mode

Each node-agent writes directly to its own NodeIOStatus CRD.


Complete Workflow

A complete workflow from pod creation to IO enforcement:

  1. Step 1: Pod Submission & Specification
  2. Step 2: Scheduler Plugin Filters Nodes
  3. Step 3: Node-Agent Registers Pod
  4. Step 4: IOI-Service Applies Limits
  5. Step 5: Continuous Monitoring
    • Every 2-5 seconds:
      1. ioi-service reads io.stat
      2. Calculates bandwidth: 180 MB/s read, 95 MB/s write (under limit)
      3. Sends to node-agent via gRPC
      4. Node-agent updates NodeIOStatus CRD (directly or via aggregator)
      5. Scheduler sees updated capacity for future scheduling decisions
  6. Step 6: Dynamic Adjustment

Part B: Design Trade-offs: Flexibility and Robustness

While integrating IOIsolation, we faced several architectural decisions. Rather than viewing these as binary choices, the ideal approach is to implement both options and let administrators choose based on their specific context. Here's how we approached these trade-offs:

1. Container Lifecycle Detection: OOB vs NRI

The Options:

  • NRI (Node Resource Interface): Tighter integration with containerd, official Kubernetes API
  • OOB (Out-of-Band): Watching Kubernetes API and pod events from outside the container runtime

Note: Figure 1 shows the NRI Client path for container lifecycle detection. OOB (via inotify) achieves the same goal—detecting when containers start/stop to apply IO limits—but through a different mechanism that doesn't require runtime integration.

Our Decision: OOB (via inotify)

Rationale:

  • Less invasive to the container runtime stack
  • Simpler rollback path (no runtime modifications)
  • Technically, there is still an asynchronous micro-race condition of a few milliseconds before io.max is written. However, a container doing unthrottled I/O for 50 milliseconds at startup will never cause a PLEG timeout or node lockup.

The Better Long-Term Solution: Implement both and make it configurable. Different environments have different constraints:

  • Use NRI when: You want tighter runtime integration, have experience with NRI, need lower latency detection
  • Use OOB when: You want simpler deployment, easier rollback, or don't want to modify the runtime stack

System Robustness Perspective: Having both options increases flexibility and reduces deployment risk. Administrators can choose based on their operational maturity and risk tolerance.

2. Bandwidth Monitoring: io.stat vs eBPF

The Options:

  • eBPF: Kernel-level tracing, more granular visibility
  • io.stat: Cgroup v2 native statistics, simple file reads

Our Decision: io.stat

Rationale:

  • Cgroup v2 native interface, simpler implementation
  • Sufficient granularity for our use case
  • Fewer moving parts, easier to debug

The Critical Issue with eBPF: Sometimes eBPF metrics are unavailable (kernel version incompatibility, eBPF program failures). If you rely solely on eBPF, you lose monitoring when it fails.

For Long-Term Solution, we can use both and a fallback strategy:

  1. Try to collect eBPF metrics
  2. If eBPF fails, fall back to io.stat
  3. Ensure metrics are always available, regardless of eBPF status

3. Architecture: With Aggregator vs Without Aggregator

The Component: The aggregator collects bandwidth metrics from all node-agents via gRPC and writes batched updates to Kubernetes CRDs (NodeIOStatus). This reduces API server write load.

The Data Flow:

Without Aggregator (Direct Mode):

Node-agent (500 nodes) → Write NodeIOStatus CRD → API Server
Frequency: Every 2-5 seconds per node
Enter fullscreen mode Exit fullscreen mode

With Aggregator (Batched Mode):

Node-agent (500 nodes) → gRPC → Aggregator → Batched CRD writes → API Server
Frequency: Batched every 5 seconds
Enter fullscreen mode Exit fullscreen mode

The Trade-offs:

Pros:

  • Dramatically reduces API server write load at scale (90% reduction)
  • Batched updates are more efficient
  • Centralized monitoring view

Cons:

  • Single point of failure: If aggregator fails, monitoring stops
  • Added complexity: One more component to deploy, monitor, debug
  • Not necessary at small scale: For <100 nodes, the API server can handle direct writes

Our Approach: We planned for Aggregator High Availability: moving the Aggregator from a single instance to a replicated, highly-available deployment (via Kubernetes Leases) to eliminate the SPOF.

4. Network Pool Sizing: Conservative vs Aggressive

The Challenge: Unlike disk bandwidth (which we measured through profiling), network bandwidth capacity is uncertain:

  • AWS reports link speed: 1000 Mbps (from /sys/class/net/eth0/speed)
  • But actual available capacity is unknown and varies by network congestion, switch limitations, etc.

Our Approach: Reserve bandwidth based on the number of BE pods and their minimum requirements:

Scenario: 20 BE pods × 5 Mbps minimum = 100 Mbps reserved for BE
GA available: 1000 Mbps - 100 Mbps = 900 Mbps maximum
Enter fullscreen mode Exit fullscreen mode

Rationale:

  • The TCP Measurement Trap: We intentionally decided not to measure live TCP bandwidth dynamically like we do for disk. Measuring live TCP throughput is computationally expensive and highly volatile. Instead, relying on a static, conservative ceiling for GA and a guaranteed minimum floor for BE pods protects the system with near-zero compute overhead.
  • The EBS Dual-Constraint: In AWS, EBS volumes are network-attached. This means EBS burst limits are constrained by two factors: the disk's io.max AND the instance's network bandwidth. By strictly throttling network traffic, we inadvertently created a secondary safeguard against EBS burst depletion.
  • Guarantees TCP viability: Each BE pod gets enough bandwidth
  • Simple and predictable: Easy to reason about and configure
  • Scheduler-aware: The scheduler knows exactly how much bandwidth is available for GA pods
  • Prevents starvation: GA cannot accidentally starve BE pods below their minimum

The Trade-off: This conservative approach leaves potential network capacity on the table to ensure BE workloads remain functional even under worst-case network conditions.


Section 5: Validation

Test Cluster Setup

We validated IOIsolation on a test cluster (15-20 nodes) to prove the technical approach before committing to production rollout.

Test Environment:

  • Cluster size: 15-20 nodes
  • Instance types: Mix of c5 and r5 instances (AWS Nitro)
  • Workloads: IO-heavy applications similar to production (simulated database workloads, data processing jobs)
  • Duration: Several weeks of testing

Technical Testing:

  • Scheduler correctly placed pods based on IO capacity
  • Node agents profiled disks and applied bandwidth limits
  • Aggregator collected metrics and updated CRDs
  • System ran for several weeks without major failures

Organizational Approval: Getting 4 cross-functional approvers (Security, Data Engineering, Compute Platform, Infrastructure) required addressing concerns about:

  • Rollback strategy if the system caused issues
  • Additional complexity in the platform

The design doc was approved, clearing the path for production rollout.


Section 6: Conclusion

We integrated IOIsolation into the company's Kubernetes platform:

  • Foundation: Kubernetes v1.22+, cgroup v2 enabled, dual filesystem compatibility
  • Components: Scheduler plugin, node agent, aggregator, CRDs
  • Validation: Test cluster (15-20 nodes) proved the technical approach works
  • Approval: Design approved by 4 cross-functional stakeholders

The system addresses the noisy neighbor problem by isolating IO resources between pods.

Before production rollout could begin, the company made a strategic shift to a managed cloud Kubernetes offering, and Intel's team was impacted by organizational changes. The foundation is in place. The knowledge is preserved here for teams facing similar challenges.


Section 7: Future Enhancements

While this architecture successfully mitigated our noisy neighbor kernel lockups, infrastructure evolution never stops. For future iterations, future teams could explore:

  1. IOPS and Latency Awareness: Currently, the system isolates based on raw bandwidth (MB/s). For strict database performance, extending io.max enforcement to include IOPS (riops/wiops) is the next logical step.

  2. L3 Cache Isolation (Intel RDT/CAT): At high density, L3 cache contention causes P99 latency degradation even when CPU priority is correct. Intel CAT (Cache Allocation Technology) can partition L3 cache between GA and BE workloads - already supported by the IOIsolation framework.


Acknowledgments

This project was a collaboration across teams and companies:

Intel team:

The company's Compute Platform team: Provided the Kubernetes foundation and operational expertise.

Cross-functional stakeholders: Asked hard questions about reliability and operational complexity that made the design stronger.

Top comments (0)