Karpenter: an introduction to the Disruption Budgets

#kubernetes #devops #aws #karpenter

Disruption budgets were introduced in version 0.36, and it looks like a very interesting tool to limit Karpenter from recreating WorkerNodes.

For example, in my case, we don’t want EC2 instances to be killed during business hours in the US because we have customers there, so we currently have consolidationPolicy=whenEmpty to prevent "unnecessary" deletion of servers and Pods on them.

Instead, with Disruption budgets, we can configure policies in such a way that operations with WhenEmpty are allowed in one period of time, and WhenEmptyOrUnderutilized in another.

See also Kubernetes: ensuring High Availability for Pods — because when using Karpenter, even with Disruption budgets configured, you need to have pods with Topology Spread and PodDisruptionBudget configured accordingly.

Karpenter Disruption types

Documentation — Automated Graceful Methods.

First, let’s see in which cases Disruption occurs at all:

Drift: occurs when there is a difference between the created NodePools or EC2NodeClass configurations and the existing WorkerNodes — then Karpenter will start recreating EC2 to bring them in line with the specified parameters
Interruption: occurs if Karpenter receives an AWS Event about an instance will be terminated, for example, if it is a Spot
Consolidation: if we have a Consolidation set to WhenEmptyOrUnderutilized or WhenEmpty, and Karpenter moves our Pods to other WorkerNodes
Note: we have Karpenter v1.0, so the policy is called WhenEmptyOrUnderutilized, for the v0.36 or v0.37 it's WhenUnderutilized

Karpenter Disruption Budgets

With the help of Disruption budgets, we can very flexibly configure when and what operations Karpenter can perform, and set a limit on how many WorkerNodes will be deleted at the same time.

Documentation — NodePool Disruption Budgets.

The configuration format is quite simple:

budgets:
- nodes: "20%"
  reasons: 
  - "Empty"
  schedule: "@daily"
  duration: 10m

Here we set:

allow deletion of WorkerNodes for 20% of the total number
for the operation when Disruption is triggered by the WhenEmpty condition
we do this every day
for 10 minutes

Parameters here can have values as:

nodes: as a percentage or a number of nodes
reasons: Drifted, Underutilized, or Empty
schedule: the schedule by which the rule is applied, in UTC (other timezones are not yet supported), see Kubernetes Schedule syntax
duration: and how long the rule is in effect, for example 1h15m

Also, it is not necessary to set all the parameters.

For example, we can describe two such budgets:

- nodes: "25%"
- nodes: "10"

Then we will have both rules working all the time, and the first one limits the number of nodes to 25% of the total number, and the second one limits the number of nodes to no more than 10 instances if we have more than 40 servers.

Also, Budgets can be combined, and if you set several of them, the limits will be taken according to the most strict one.

In the first example, we apply the rule for 20% of nodes and the WhenEmpty condition, and the rest of the time the default disruption rules will work - that is, 10% of the total number of servers with the specified consolidationPolicy.

Therefore, we can write the rule as follows:

budgets:
- nodes: "20%"
  reasons: 
  - "Empty"
  schedule: "@daily"
  duration: 10m
- nodes: 0

Here, the last rule works all the time, and will be a kind of fuse: we prohibit everything, but allow disruptions to be executed according to the WhenEmpty policy for 10 minutes once a day starting from 00:00 UTC.

Disruption Budgets example

Going back to my task:

we have a Backend API in Kubernetes on a dedicated NodePool, and our customers are mostly from the USA, so we want to minimize the down-scaling of WorkerNodes during US business hours
to do this, we want to block all operations on WhenUnderutilized during working hours in the USA Central Time's
Karpenter’s schedule uses the UTC zone, so the start of the working day in the USA Central Time 9:00 is 15:00 UTC
operations with WhenEmpty are allowed at any time, but only 1 WorkerNode at a time
Drift - similarly, because when I deploy changes, I want to see the result immediately

So, in fact, we need to set two budgets:

for Underutilized - we prohibit everything from Monday to Friday for 9 hours starting from 15:00 UTC
for Empty and Drifted - allow at any time, but only 1 node at a time, not the default 10%

Then our NodePool will look like this:

apiVersion: karpenter.sh/v1
kind: NodePool
metadata:
  name: backend1a
spec:
  template:
    metadata:
      labels:
        created-by: karpenter
        component: devops
    spec:
      taints:
        - key: BackendOnly
          operator: Exists
          effect: NoSchedule
      nodeClassRef:
        group: karpenter.k8s.aws
        kind: EC2NodeClass      
        name: defaultv1a
      requirements:
        - key: karpenter.k8s.aws/instance-family
          operator: In
          values: ["c5"]
        - key: karpenter.k8s.aws/instance-size
          operator: In
          values: ["large", "xlarge"]
        - key: topology.kubernetes.io/zone
          operator: In
          values: ["us-east-1a"]
        - key: karpenter.sh/capacity-type
          operator: In 
          values: ["spot", "on-demand"]
  # total cluster limits 
  limits:
    cpu: 1000
    memory: 1000Gi
  disruption:
    consolidationPolicy: WhenEmptyOrUnderutilized
    consolidateAfter: 600s
    budgets:
      - nodes: "0" # block all
        reasons:
        - "Underutilized" # if reason == underutilized
        schedule: "0 15 * * mon-fri" # starting at 15:00 UTC during weekdays
        duration: 9h # during 9 hours
      - nodes: "1" # allow by 1 WorkerNode at a time
        reasons:
        - "Empty"
        - "Drifted"

Deploy it, check NodePool:

$ kk describe nodepool backend1a   
Name: backend1a
...
API Version: karpenter.sh/v1
Kind: NodePool
...
Spec:
  Disruption:
    Budgets:
      Duration: 9h
      Nodes: 0
      Reasons:
        Underutilized
      Schedule: 0 15 * * mon-fri
      Nodes: 1
      Reasons:
        Empty
        Drifted
    Consolidate After: 600s
    Consolidation Policy: WhenEmptyOrUnderutilized
...

And we can see in the Karpenter’s logs that a Disruption was triggered by WhenUnderutilized:

karpenter-55b845dd4c-tlrdr:controller {"level":"INFO","time":"2024-09-16T10:48:26.777Z","logger":"controller","message":"disrupting nodeclaim(s) via delete, terminating 1 nodes (2 pods) ip-10-0-42-250.ec2.internal/t3.small/spot","commit":"62a726c","controller":"disruption","namespace":"","name":"","reconcileID":"db2233c3-c64b-41f2-a656-d6a5addeda8a","command-id":"1cd3a8d8-57e9-4107-a701-bd167ed23686","reason":"underutilized"}
karpenter-55b845dd4c-tlrdr:controller {"level":"INFO","time":"2024-09-16T10:48:27.016Z","logger":"controller","message":"tainted node","commit":"62a726c","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-0-42-250.ec2.internal"},"namespace":"","name":"ip-10-0-42-250.ec2.internal","reconcileID":"f0815e43-94fb-4546-9663-377441677028","taint.Key":"karpenter.sh/disrupted","taint.Value":"","taint.Effect":"NoSchedule"}
karpenter-55b845dd4c-tlrdr:controller {"level":"INFO","time":"2024-09-16T10:50:35.212Z","logger":"controller","message":"deleted node","commit":"62a726c","controller":"node.termination","controllerGroup":"","controllerKind":"Node","Node":{"name":"ip-10-0-42-250.ec2.internal"},"namespace":"","name":"ip-10-0-42-250.ec2.internal","reconcileID":"208e5ff7-8371-442a-9c02-919e3525001b"}

Done.

Originally published at RTFM: Linux, DevOps, and system administration.