Matthias Bruns

Posted on May 28 • Originally published at appetizers.io

Kubernetes 1.36 Workload-Aware Scheduling: Gang Scheduling and Resource Optimization for AI/ML Workloads

#kubernetes #scheduling #aiworkloads #resourceoptimization

Kubernetes 1.36 introduces significant improvements to workload-aware scheduling that fundamentally change how AI/ML and batch processing workloads run in production clusters. The new architecture separates concerns between static templates and runtime state management, enabling true gang scheduling and coordinated resource allocation for the first time as a native Kubernetes feature.

If you're running distributed training jobs, batch processing pipelines, or any workload that requires multiple pods to start together, these changes will eliminate the resource waste and scheduling inefficiencies you've been battling with custom schedulers and workarounds.

Understanding the Architecture Evolution

Kubernetes 1.36 builds on the foundation laid in 1.35 with a clean architectural separation. According to the official Kubernetes blog, the system now separates API concerns where "the Workload API acts as a static template, while the new PodGroup API handles the runtime state."

This separation matters because it allows controllers, status reporting, and future workload-aware features to reason about related pods even when those pods don't require strict all-or-nothing admission. As Ryota Sawada explains, "The workload aware scheduling breaks that template part into workload and the actual runtime object into PodGroup, and that clear separation gives us even further clear connection point for the DRA."

The practical impact is that you can now define workload templates once and reuse them across multiple runtime instances, each with their own PodGroup managing the actual pod lifecycle and coordination.

Gang Scheduling: All-or-Nothing Pod Admission

Gang scheduling solves the fundamental problem of partial workload admission. In traditional Kubernetes scheduling, pods from a distributed training job might be scheduled individually, leading to scenarios where some pods start while others remain pending due to resource constraints. This creates resource waste and training delays.

The new gang scheduling implementation uses the all-or-nothing policy through the minCount field. As documented in the Medium article by Heba Elayoty, "The minCount field defines the quorum: at least that many pods must be schedulable together for the group to be admitted."

This means your distributed training job with 8 worker pods will only start when all 8 pods can be scheduled simultaneously, preventing partial deployments that consume resources without producing useful work.

Workload-Aware Preemption

Kubernetes 1.36 introduces workload-aware preemption through KEP-5710, which treats groups of related pods as single entities for both scheduling and preemption decisions. According to Palark's analysis, "groups of related Pods (PodGroups) are now treated as a single entity for both scheduling and preemption. Rather than removing Pods one by one, the scheduler will figure out" how to handle entire workload groups.

This prevents the scenario where a high-priority workload preempts only some pods from a lower-priority distributed job, leaving the remaining pods running but unable to make progress. Instead, the scheduler considers the entire workload group when making preemption decisions.

Implementing Gang Scheduling for AI/ML Workloads

To implement gang scheduling for your AI workloads, you'll work with the Workload and PodGroup APIs. The Workload API defines the static template for your distributed job, while the PodGroup manages the runtime coordination.

Here's how to structure a distributed training workload that requires all pods to start together:

apiVersion: workload.k8s.io/v1alpha1
kind: Workload
metadata:
  name: distributed-training-template
  namespace: ml-workloads
spec:
  podTemplate:
    spec:
      containers:
      - name: trainer
        image: tensorflow/tensorflow:latest-gpu
        resources:
          requests:
            nvidia.com/gpu: 1
            memory: 8Gi
            cpu: 4
          limits:
            nvidia.com/gpu: 1
            memory: 8Gi
            cpu: 4
        env:
        - name: WORLD_SIZE
          value: "8"
        - name: RANK
          valueFrom:
            fieldRef:
              fieldPath: metadata.annotations['workload.k8s.io/pod-index']
---
apiVersion: workload.k8s.io/v1alpha1
kind: PodGroup
metadata:
  name: training-job-001
  namespace: ml-workloads
spec:
  workloadRef:
    name: distributed-training-template
  minCount: 8
  replicas: 8
  schedulingPolicy: AllOrNothing

The minCount: 8 ensures that all 8 training pods must be schedulable before any of them start. This prevents resource waste from partial deployments and ensures your distributed training job has the full complement of workers before beginning.

Resource Optimization Strategies

Gang scheduling enables several resource optimization strategies that weren't possible with individual pod scheduling:

Coordinated Resource Allocation: Since all pods in a workload group are scheduled together, you can optimize resource requests knowing the entire workload's requirements. This prevents over-provisioning individual pods to account for uncertainty about whether the full workload will be admitted.

Improved Cluster Utilization: Gang scheduling reduces resource fragmentation by ensuring workloads only consume resources when they can run effectively. This is particularly important for GPU clusters where partial workloads tie up expensive resources without producing results.

Predictable Scheduling Behavior: With all-or-nothing admission, you can predict when workloads will start based on available cluster capacity, making it easier to plan batch processing windows and manage SLAs.

Integration with Dynamic Resource Allocation

The workload-aware scheduling improvements in Kubernetes 1.36 integrate closely with Dynamic Resource Allocation (DRA) for GPU scheduling. This integration provides native support for coordinated GPU allocation across pod groups, eliminating the need for custom schedulers or external resource managers.

The clear separation between Workload and PodGroup APIs creates what Ryota Sawada calls "an even further clear connection point for the DRA," enabling sophisticated resource allocation policies that consider the entire workload's GPU requirements when making scheduling decisions.

Production Implementation Guidelines

When implementing workload-aware scheduling in production, consider these key practices:

Start with Non-Critical Workloads: Begin by implementing gang scheduling for development and testing workloads before moving to production training jobs. This allows you to validate the behavior and tune resource requirements without impacting critical workflows.

Monitor Resource Utilization: Track how gang scheduling affects overall cluster utilization. While it may temporarily reduce utilization as workloads wait for full resource availability, it should improve effective utilization by reducing wasted partial deployments.

Set Appropriate Timeouts: Configure reasonable timeouts for workload admission to prevent jobs from waiting indefinitely for resources. This is particularly important in shared clusters where resource availability fluctuates.

Plan for Preemption Scenarios: Design your workload priorities and resource requests with workload-aware preemption in mind. Higher-priority workloads will preempt entire lower-priority workload groups, not individual pods.

Cost Optimization Benefits

Gang scheduling delivers measurable cost benefits for AI/ML workloads:

Reduced GPU Waste: By preventing partial training jobs from consuming GPU resources without making progress, gang scheduling can significantly improve GPU utilization rates. This is critical given GPU costs in cloud environments.

Lower Networking Costs: Distributed training jobs that start all pods simultaneously reduce the time spent in initialization and synchronization phases, minimizing cross-zone networking costs for multi-zone deployments.

Improved Throughput: Coordinated scheduling reduces the time between job submission and completion, allowing you to process more workloads with the same infrastructure investment.

Monitoring and Observability

Effective monitoring of workload-aware scheduling requires tracking metrics at both the individual pod and workload group levels. Key metrics include:

Workload admission latency (time from submission to all pods scheduled)
Resource utilization efficiency (productive vs. idle resource time)
Preemption frequency and impact on workload groups
Queue depth for pending workloads waiting for gang admission

The PodGroup API provides status information about workload coordination that wasn't available with individual pod monitoring, enabling better visibility into distributed workload behavior.

Future Roadmap and Considerations

Kubernetes 1.36 represents the second major iteration of workload-aware scheduling, building on the foundation introduced in 1.35. The Kubernetes blog notes that "The recent 1.35 release of Kubernetes delivered the first tranche of workload aware scheduling improvements," indicating this is an evolving area with more enhancements planned.

Future developments will likely focus on more sophisticated scheduling policies, better integration with cluster autoscaling, and enhanced support for heterogeneous workloads that mix different resource types and scheduling requirements.

The clean API separation introduced in 1.36 provides a solid foundation for these future enhancements while maintaining backward compatibility with existing workloads.

Getting Started

To begin using workload-aware scheduling in Kubernetes 1.36, ensure your cluster has the feature gates enabled and start with simple gang scheduling use cases. The native support eliminates the need for third-party schedulers and custom controllers that many teams have been using as workarounds.

Focus on workloads where coordination provides clear benefits—distributed training, batch processing pipelines, and any application where partial deployment creates resource waste or operational complexity. The investment in migrating to workload-aware scheduling pays dividends through improved resource efficiency and more predictable application behavior.

The architectural improvements in Kubernetes 1.36 make workload-aware scheduling a production-ready solution for coordinated workloads, finally bringing native support for patterns that AI/ML and batch processing teams have needed for years.

DEV Community