daniel jeong

Posted on Mar 28 • Edited on Apr 1 • Originally published at manoit.co.kr

Microsoft Kubernetes DRA

#kubernetes #kubecon #dra #airunway

Microsoft KubeCon 2026 Key Announcements — DRA GA, AI Runway, and Kubernetes' Declaration as AI Infrastructure OS

Microsoft delivered a clear message at March 2026 KubeCon in Amsterdam: Kubernetes is no longer just a control plane for cloud-native apps, but the operational foundation of modern AI infrastructure. From DRA GA graduation to AI Runway announcement, we provide detailed analysis of Kubernetes entering the AI era.

1. Kubernetes Becomes the Operational Foundation of AI Infrastructure

At KubeCon + CloudNativeCon Europe 2026 in Amsterdam in March, Microsoft delivered a powerful message: Kubernetes is no longer just a control plane for cloud-native applications, but is becoming the operational foundation of modern AI infrastructure.

The evidence supporting this is clear: approximately 66% of generative AI workloads already operate on Kubernetes, and GPU scheduling standardization, inference serving frameworks, and AI observability have all been integrated into the K8s ecosystem.

2. DRA (Dynamic Resource Allocation) GA Graduation — GPU Scheduling Standardization

One of the most important announcements at this KubeCon was the GA (General Availability) graduation of DRA (Dynamic Resource Allocation). DRA replaces vendor-specific GPU scheduling with Kubernetes-native declarative approaches.

2.1 Problems with Previous Approach

Previously, GPU allocation used static resource approaches like nvidia.com/gpu. Problems with this approach:

NVIDIA Monopoly: GPU scheduling is dependent on NVIDIA licensing protocols
GPU Sharing Not Possible: Allocate entire GPU or nothing; no sharing
No Topology Support: Cannot consider physical proximity between GPU and NIC
No Multi-GPU Type Support: Cannot schedule A100, H100, NVIDIA MI, Intel accelerators in a unified manner

2.2 DRA Innovation

DRA allows dynamically requesting and sharing specialized hardware like GPUs, FPGAs, and network accelerators, enabling optimal scheduling through topology-aware placement that considers physical proximity between GPUs and NICs.

# Previous static resource allocation (nvidia.com/gpu)
apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  containers:
  - name: training
    image: ml-training:latest
    resources:
      limits:
        nvidia.com/gpu: 1  # ❌ 1 entire GPU, no sharing

---

# DRA-based dynamic resource allocation (Kubernetes 1.33+)
apiVersion: v1
kind: Pod
metadata:
  name: ml-training
spec:
  resourceClaims:
    - name: gpu-claim
      source:
        driverName: nvidia.com  # ✓ Vendor-neutral
        deviceClass: gpu-compute
        selectors:
          - matchLabels:
              accelerator-type: nvidia-a100

  containers:
  - name: training
    image: ml-training:latest
    resources:
      claims:
        - name: gpu-claim
          # ✓ GPU sharing possible, memory efficient
          request: 20  # 20% GPU allocation allows 5 pods to run simultaneously

2.3 DRA and Kubernetes 1.36 Synergy

In Kubernetes 1.36, Workload Aware Scheduling integrates with DRA's Workload API and strengthens KubeRay integration, allowing developers to more easily request and manage high-performance infrastructure for training and inference.

Scenario	Previous (Static)	DRA (Dynamic)	Benefit
Sharing 1x A100 GPU (80GB)	Not possible (1 pod only)	4 pods (20% each)	4x GPU utilization
GPU-NIC topology placement	Not considered (NUMA worries)	Auto optimized	40% improved latency
Multi-GPU type support	NVIDIA only (AMD separate)	Unified interface	Multi-cloud standardization

3. AI Runway — Deploy Models Without Knowing Kubernetes

AI Runway is a newly open-sourced project by Microsoft that provides a common API for Kubernetes-based inference workloads. It offers platform teams a standardized interface to manage model deployments centrally and flexibly respond as serving technologies evolve.

3.1 Four Core Features of AI Runway

Web Interface-Based Deployment: ML engineers and data scientists unfamiliar with Kubernetes can deploy models. No need to write YAML directly; just follow the guided workflow
HuggingFace Model Catalog Integration: Easy model search and selection. Popular open-source models deployable with one click
GPU Memory Suitability Analysis: Immediate resource planning with each model's memory requirements and real-time cost estimation. e.g., "Llama 2 70B requires 2x A100, costs $3,000/month"
Multiple Inference Runtime Support: Supports various inference runtimes (NVIDIA Dynamo, KubeRay, llm-d, KAITO) without vendor lock-in

3.2 Deployment Flow Comparison

# ❌ Previous approach (ML Engineer's tasks)
1. Write Deployment YAML
2. Configure Service
3. Set up Ingress
4. Allocate GPU resources
5. Configure monitoring
→ 3-5 hours, Kubernetes expert needed

# ✅ AI Runway (ML Engineer's experience)
1. Click "Deploy LLM"
2. Select model from HuggingFace (Llama 2 70B)
3. Verify GPU memory (auto-calculated: 2x A100 needed)
4. Click Deploy
→ 5 minutes complete, no Kubernetes knowledge needed

# Backend (Platform Team configures once)
AI Runway → KServe → vLLM → NVIDIA Dynamo → K8s Deployment

AI Runway's innovation significantly lowers the barrier to inference workload deployment. Previously, serving models required directly configuring Kubernetes Deployment, Service, and Ingress. AI Runway abstracts this process, letting teams focus on model serving.

4. Cilium Enhancement — Sidecar-Free mTLS and eBPF Security

Microsoft greatly expanded its contributions to the Cilium project. The key announcement is native mTLS ztunnel support. This implements encrypted pod-to-pod communication without sidecar proxies, using X.509 certificates and SPIRE-based management.

4.1 Importance of Sidecar-Free mTLS

Traditional service meshes like Istio/Linkerd inject sidecar proxies into each pod. The CPU and memory consumed by these sidecars:

Item	Istio/Linkerd (Sidecar)	Cilium mTLS (eBPF)	Savings
CPU overhead (per pod)	50~100 millicores	~2 millicores	96~98%
Memory overhead (per pod)	128~256 MB	~5 MB	96~98%
100-pod cluster total overhead	5~10 CPU cores, 12.8~25.6 GB	0.2 CPU cores, 0.5 GB	Dramatic savings

This savings is especially critical in AI clusters. Resources consumed by sidecars on GPU nodes represent direct cost increases. If A100 GPU costs $15,000/month, additional GPU needs from sidecar overhead waste $3,000~$5,000 monthly.

5. New CNCF Projects — HolmesGPT and Dalec

Two projects contributed by Microsoft joined the CNCF sandbox.

5.1 HolmesGPT — AI-Based Automatic Troubleshooting

HolmesGPT is an AI-based agentic troubleshooting tool. It combines telemetry data, inference engines, and operational runbooks to automatically diagnose complex cloud-native system issues. While traditional observability tools show "what happened," HolmesGPT infers "why it happened" and suggests solutions.

5.2 Dalec — Supply Chain Security Enhancement

Dalec is a project defining declarative specifications for system package builds. It creates minimal container images while automatically including SBOM (Software Bill of Materials) and provenance attestation. In 2026 where supply chain security is critical, building security into build steps is timely and appropriate.

6. AKS Platform Updates — Operational Stability and Observability Enhancement

6.1 Operational Improvements

Blue-Green Agent Pool Upgrades: Parallel validation reduces deployment risk
Agent Pool Rollback Capability: Quickly revert versions and images
Prepared Image Specification: Improved node provisioning speed and consistency

6.2 Observability Improvements

GPU Performance Metrics: GPU metrics integrated into managed Prometheus and Grafana
L7-Level Network Visibility: Per-flow analysis at HTTP, gRPC, Kafka levels
Dynamic Metrics Collection: Container-level metrics via Kubernetes custom resources

6.3 Multi-Cluster and Storage

Azure Kubernetes Fleet Manager: Managed Cilium Cluster Mesh, unified service registry
Elastic SAN Pool Sharing: Reduces disk management burden per workload
AKS Desktop GA: Local development environment configuration matches production

7. Practical Implications

💡 Organizational Checklist:

GPU Workload Operations: Focus on DRA GA. Transitioning from static nvidia.com/gpu allocation to dynamic DeviceClass/ResourceClaim-based allocation can improve GPU utilization by 20~30% through topology-aware scheduling and resource sharing.

ML Team and Platform Team Gap: Review AI Runway. ML engineers can deploy models without YAML, enabling fast self-service inference platform construction.

Service Mesh Overhead: Time to consider Cilium's sidecar-less mTLS. Sidecar elimination is especially dramatic for AI workloads where GPU node resources are precious.

8. Conclusion: Kubernetes Enters the AI Era

The Microsoft KubeCon 2026 announcements present a clear direction: Kubernetes is evolving into a single platform that unifies AI workload scheduling, serving, networking, security, observability, and lifecycle management.

DRA GA standardizes GPU scheduling, AI Runway abstracts model serving, and Cilium mTLS eliminates infrastructure overhead. Together, these enable organizations to operate AI infrastructure efficiently at scale.

Kubernetes is no longer a "container orchestration tool." It is the operational foundation of modern AI infrastructure.

This article was created with AI technology support. For more cloud-native engineering insights, visit the ManoIT Tech Blog.

DEV Community