Karpenter on AKS in 2026: What Actually Works
Karpenter on AKS has gone from "interesting experiment" to "something you can actually run in production" with some caveats that will save you a weekend of pain if you read them now. This post is a field report, not a sales pitch.
The Short Version
If you're running homogeneous, predictable workloads and you're happy with cluster-autoscaler (CAS), stay there. CAS is boring, it works, and Azure supports it fully. If you're running GPU workloads, spot-heavy batch pipelines, or you need bin-packing that doesn't require you to pre-define a node pool for every VM SKU you might want, Karpenter is now worth the operational overhead.
What Karpenter Actually Does on AKS
Karpenter watches for unschedulable pods and provisions nodes directly via the Azure provider, no VMSS node pools required for every SKU combination. It provisions, consolidates, and terminates nodes based on pod requirements and your defined NodePool and AKSNodeClass resources.
The AKS provider for Karpenter (karpenter-provider-azure) is a separate project from the AWS provider. Same core Karpenter engine, different provider implementation. This matters because feature parity with AWS Karpenter is not guaranteed and the cadence of releases differs.
Prerequisites and Installation
You need:
- An AKS cluster with
--network-plugin azureor--network-plugin overlay(Azure CNI in either mode works; kubenet is not supported) - A managed identity with the right RBAC, the provider needs to create and delete VMs and manage NICs, disks, and NSGs
- Workload identity enabled on the cluster
- Karpenter installed via Helm
The official installation path uses Helm with values pulled from your cluster. Here's a stripped-down install sequence:
# Set environment variables: replace with your actual values
export CLUSTER_NAME="my-aks-cluster"
export RESOURCE_GROUP="my-rg"
export LOCATION="eastus2"
export KARPENTER_NAMESPACE="kube-system"
# Get cluster details needed for Karpenter config
export SUBSCRIPTION_ID=$(az account show --query id -o tsv)
export NODE_RESOURCE_GROUP=$(az aks show \
--name "${CLUSTER_NAME}" \
--resource-group "${RESOURCE_GROUP}" \
--query nodeResourceGroup -o tsv)
# Install via Helm
# NEEDS_VALIDATION: confirm chart version and repo URL against
# https://github.com/Azure/karpenter-provider-azure at time of deployment
helm upgrade --install karpenter oci://mcr.microsoft.com/aks/karpenter/karpenter \
--namespace "${KARPENTER_NAMESPACE}" \
--create-namespace \
--set "settings.clusterName=${CLUSTER_NAME}" \
--set "settings.location=${LOCATION}" \
--set "settings.subscriptionID=${SUBSCRIPTION_ID}" \
--set "settings.resourceGroup=${NODE_RESOURCE_GROUP}" \
--wait
Defining Your First NodePool
This is where Karpenter's model diverges most from node pools. Instead of pre-creating a pool for every SKU you might want, you define constraints and let Karpenter pick:
apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
name: general
spec:
template:
metadata:
labels:
workload-type: general
spec:
nodeClassRef:
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
name: default
requirements:
- key: karpenter.sh/capacity-type
operator: In
values: ["on-demand", "spot"]
- key: kubernetes.io/arch
operator: In
values: ["amd64"]
- key: karpenter.azure.com/sku-family
operator: In
values: ["D", "E"]
- key: karpenter.azure.com/sku-version
operator: Gt
values: ["3"]
limits:
cpu: "200"
memory: 800Gi
disruption:
consolidationPolicy: WhenEmptyOrUnderutilized
consolidateAfter: 30s
---
apiVersion: karpenter.azure.com/v1alpha2
kind: AKSNodeClass
metadata:
name: default
spec:
imageFamily: AzureLinux
osDiskSizeGB: 128
Note on API versions: The
karpenter.azure.comAPI group is versioned separately from upstream Karpenter.v1alpha2was current as of early 2026 but NEEDS_VALIDATION check the CRD definitions in the installed chart before you copy this into a GitOps repo.
What Works Well
Spot consolidation is the headline win. When spot VMs get preempted or you have underutilized nodes, Karpenter's consolidation loop handles bin-packing without you writing any automation. With CAS you're responsible for node pool min/max sizing and the consolidation is coarse.
Multi-SKU scheduling is the other real win. A pod requesting 8 vCPU and 64 GiB RAM will cause Karpenter to search the allowed SKU families for a node that fits, rather than failing because your single pre-configured node pool is exhausted.
GPU node provisioning works, including time-slicing scenarios, as long as your AKSNodeClass uses an image family that ships the NVIDIA drivers. AzureLinux with GPU extensions does this. You still need to manage the device plugin separately.
What Is Still Rough
Node provisioning latency is higher than AWS because Azure VM creation is slower than EC2. Plan for 3–5 minutes from pod pending to node ready on cold starts. This isn't a Karpenter problem per se, but it affects how you design your buffer capacity and PodDisruptionBudgets.
Windows node pools are not supported by the Azure Karpenter provider. If you have Windows workloads, keep a static node pool managed by CAS or manual scaling.
Custom VNet/subnet selection requires care. The AKSNodeClass lets you specify subnet IDs, but if you're using private clusters with complex network topologies, test thoroughly before rolling to production. Subnet exhaustion errors surface late and are annoying to debug.
Observability is immature. Karpenter emits metrics to Prometheus and logs to stdout, but the AKS provider's specific actions (VM creation, NIC attachment) aren't surfaced as well as you'd want. You'll be reading kubectl logs more than you'd like.
Concrete Steps to a Safe Rollout
Start with a non-production cluster. Run Karpenter alongside CAS, not instead of it. CAS can manage your system node pool; Karpenter handles a
workloadnode pool namespace.Define
limitson your NodePool. Without a CPU/memory ceiling, a scheduling bug or runaway HPA can provision hundreds of nodes before you notice. Set limits conservatively and raise them deliberately.Set
consolidateAfterto something sane for your workload. 30 seconds is aggressive for stateful apps. Use 5–10 minutes for anything with slow startup or persistent volumes.Test spot preemption handling. Deploy a test workload on spot nodes and manually deallocate a VM. Verify that Karpenter reprovisioned within your acceptable window and that your pod disruption budgets held.
Add Karpenter's node labels to your monitoring dashboards. Specifically track
karpenter.sh/capacity-type,karpenter.azure.com/sku-name, andkarpenter.sh/nodepoolas label dimensions so you can see cost and performance breakdown by node type.Pin your Helm chart version in GitOps. The provider is still in active development. Uncontrolled upgrades have broken NodePool CRD schemas between minor versions. Treat upgrades as a planned event.
CAS vs. Karpenter: The Honest Comparison
| Concern | CAS | Karpenter |
|---|---|---|
| Operational maturity on AKS | Production-grade | Production-capable with caveats |
| Multi-SKU bin-packing | Requires pre-defined pools | Native |
| Spot handling | Decent | Better |
| Windows nodes | Yes | No |
| Debug tooling | Mature | Developing |
| Azure support | First-party | Community + Microsoft OSS |
Final Thoughts
Karpenter on AKS in 2026 is the right choice if you have heterogeneous compute requirements and engineering capacity to own the operational model. It is not yet the "set it and forget it" experience that CAS is for straightforward clusters.
The Azure team has been shipping at a reasonable pace and the GitHub issues backlog is actually getting shorter, which is a good sign. The API is stabilizing. The path from alpha to beta to stable is visible.
Just don't copy that YAML into production without validating the API versions first. I warned you.
Top comments (0)