Hamdi (KHELIL) LION

Posted on Jun 26

🧹 Keeping Your Kubernetes Cluster Clean with the Descheduler

#kubernetes #cloud #devops #containers

Your scheduler did a great job placing pods this morning. But your cluster never stops moving, and by this afternoon those decisions are already a little bit wrong. 😅

The kube-scheduler only decides things once, at pod creation time. After that it walks away. The descheduler is the friend that comes back later, looks at the mess, and gently tidies up.

Let me walk you through it. 🚀

🗺️ what we will cover

🤔 why a perfectly scheduled cluster slowly drifts
🧹 what the descheduler does (and what it does not do)
🧩 the building blocks: profiles, plugins and extension points
🔌 the strategy plugins you will actually reach for
🛡️ the Default Evictor, your safety net
🚀 installing it with Helm (full values file)
⏱️ CronJob vs Deployment mode
🧪 testing safely with dry run
🧯 a production safety checklist
🧠 the gotcha that bites everyone

🤔 why your cluster drifts

The scheduler makes a one time decision based on the cluster as it looked the moment a pod appeared. Real clusters keep changing under that decision:

🆕 New nodes join (autoscaler or manual) and sit half empty while old nodes stay packed.
🔧 You drain and uncordon a node for maintenance, and nothing moves back onto it.
🏷️ Labels and taints change, so pods that matched the old rules now violate them.
💥 Nodes fail, pods pile up elsewhere, and the spread never recovers on its own.

The result is drift: hotspots, lopsided utilization, and affinity rules that are quietly broken.

👉 The descheduler exists to correct this drift, on a schedule, without you babysitting it.

🧹 what it actually does

Here is the part people get wrong, so let me be blunt about it.

The descheduler does not schedule pods. It only evicts them. 🙅

It finds pods that are poorly placed, evicts them through the standard Kubernetes Eviction API, and then trusts the normal kube-scheduler to recreate them in a better spot.

That has two big consequences:

✅ It plays nicely with the scheduler you already have, no replacement needed.
✅ It respects PodDisruptionBudgets, because it uses the same eviction path everything else does. If an eviction would break a PDB, the request is rejected and that pod is skipped.

So the mental model is simple: the descheduler makes room, the scheduler fills it. 🔁

🧩 the building blocks

Modern descheduler config uses the descheduler/v1alpha2 API. Three concepts matter.

1. Profiles 📋
A profile is a named bundle of plugins and their config. You can run more than one.

2. Plugins 🔌
Each strategy is a plugin (for example LowNodeUtilization). You configure it under pluginConfig and then switch it on under plugins.

3. Extension points 🪝
This is where a plugin runs. The two you care about for strategies are:

deschedule: looks at pods one by one and evicts the bad ones.
balance: looks across nodes and evicts to even things out.

There are also filter and preEvictionFilter points, which the Default Evictor uses to decide what is safe to touch.

Here is the shape of a policy so the pieces click:

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: default
    pluginConfig:
      - name: "LowNodeUtilization"
        args:
          thresholds:
            cpu: 20
            memory: 20
            pods: 20
          targetThresholds:
            cpu: 50
            memory: 50
            pods: 50
    plugins:
      balance:
        enabled:
          - "LowNodeUtilization"

🔌 the strategies you will actually use

There are many plugins, but a handful do most of the work. Let me group them by extension point.

⚖️ balance plugins

These rebalance placement across the cluster.

🔥 LowNodeUtilization: finds underused nodes and evicts pods from overused ones, hoping they reschedule onto the quiet nodes. Note: utilization here is based on pod requests vs allocatable, not live metrics, unless you wire up a metrics provider.
🌶️ HighNodeUtilization: the opposite idea, pack pods off underused nodes so they can be scaled down. Great with a cluster autoscaler that scales in.
👯 RemoveDuplicates: stops multiple pods of the same workload from piling onto one node, which is exactly what you want for high availability.
🗺️ RemovePodsViolatingTopologySpreadConstraint: re-balances pods so they respect your topology spread constraints again.

🧨 deschedule plugins

These walk pods and evict the ones that no longer belong.

🚫 RemovePodsViolatingNodeAffinity: evicts pods stuck on nodes that no longer match their node affinity.
🧪 RemovePodsViolatingNodeTaints: evicts pods that no longer tolerate a node's taints.
🧲 RemovePodsViolatingInterPodAntiAffinity: cleans up pods that now break their anti affinity rules.
♻️ RemovePodsHavingTooManyRestarts: evicts crash looping pods past a restart threshold so they can try a healthier node.
⏳ PodLifeTime: evicts pods older than a max lifetime, handy for forcing periodic recycling.
🧟 RemoveFailedPods: clears out pods stuck in a failed state.

👉 Start with RemoveDuplicates and LowNodeUtilization. They give the most value with the least surprise.

🛡️ the Default Evictor, your safety net

Before any strategy evicts a pod, the Default Evictor decides whether that pod is even allowed to be touched. This is your most important safety layer, so do not skip it.

The args you will reach for:

🧷 nodeFit: true: only evict a pod if there is actually another node it could land on. This single setting prevents most pointless eviction loops.
🔢 minReplicas: never evict if fewer than this many replicas exist, so you do not knock out a lonely pod.
👑 evictSystemCriticalPods: false: leave system critical priority pods alone (this is the default).
💾 evictLocalStoragePods: false: do not evict pods using local storage, unless you really mean to.
🐝 evictDaemonSetPods: false: leave DaemonSet pods in place (this is the default).

apiVersion: "descheduler/v1alpha2"
kind: "DeschedulerPolicy"
profiles:
  - name: default
    pluginConfig:
      - name: "DefaultEvictor"
        args:
          nodeFit: true
          minReplicas: 2
          evictSystemCriticalPods: false
          evictLocalStoragePods: false
          evictDaemonSetPods: false

Newer releases also offer a podProtections block (with defaultDisabled and extraEnabled lists, plus extras like PodsWithPVC and PodsWithoutPDB). The classic args above still work and are easier to read when you are getting started.

🚀 installing it with Helm

The official Helm chart is the cleanest path. These commands are for the chart published by the project.

helm repo add descheduler https://kubernetes-sigs.github.io/descheduler/
helm repo update
helm install descheduler descheduler/descheduler --namespace kube-system

Now here is a complete, opinionated values.yaml you can actually start from. It runs as a CronJob, enables the safe high value strategies, and keeps the Default Evictor strict.

# values.yaml
kind: CronJob
schedule: "*/15 * * * *"   # run every 15 minutes, tune to taste

# pin the image to a known version
image:
  repository: registry.k8s.io/descheduler/descheduler
  tag: v0.36.0

deschedulerPolicyAPIVersion: "descheduler/v1alpha2"

deschedulerPolicy:
  # cluster wide eviction guard rails
  maxNoOfPodsToEvictPerNode: 5
  maxNoOfPodsToEvictPerNamespace: 10
  maxNoOfPodsToEvictTotal: 50

  profiles:
    - name: default
      pluginConfig:
        - name: "DefaultEvictor"
          args:
            nodeFit: true
            minReplicas: 2
            evictSystemCriticalPods: false
            evictLocalStoragePods: false
            evictDaemonSetPods: false

        - name: "RemoveDuplicates"
          args:
            excludeOwnerKinds:
              - "DaemonSet"

        - name: "LowNodeUtilization"
          args:
            thresholds:
              cpu: 20
              memory: 20
              pods: 20
            targetThresholds:
              cpu: 50
              memory: 50
              pods: 50

        - name: "RemovePodsHavingTooManyRestarts"
          args:
            podRestartThreshold: 100
            includingInitContainers: true

      plugins:
        balance:
          enabled:
            - "RemoveDuplicates"
            - "LowNodeUtilization"
        deschedule:
          enabled:
            - "RemovePodsHavingTooManyRestarts"

Apply it with:

helm upgrade --install descheduler descheduler/descheduler \
  --namespace kube-system \
  --values values.yaml

Prefer Kustomize? The project ships base manifests too. Swap the ref for the release branch that matches your version (for example release-1.36 for v0.36.0):
kustomize build 'github.com/kubernetes-sigs/descheduler/kubernetes/cronjob?ref=release-1.34' | kubectl apply -f -

⏱️ CronJob vs Deployment mode

The descheduler can run in three shapes. Two matter day to day.

🕒 CronJob: runs on a schedule (schedule: "*/15 * * * *") and exits. Simple, cheap, easy to reason about. Great default.
♾️ Deployment: runs continuously and re-evaluates every deschedulingInterval (for example 5m). Pair it with leader election if you run more than one replica.

# Deployment mode snippet
kind: Deployment
replicas: 1
deschedulingInterval: 5m
leaderElection:
  enabled: true

👉 If you are not sure, start with CronJob. You only move to Deployment when you want tighter, continuous reaction.

🧪 test safely with dry run first

Please do not point this at production blind. Run it in dry run mode and read the logs first. 🙏

With Helm:

helm upgrade --install descheduler descheduler/descheduler \
  --namespace kube-system \
  --values values.yaml \
  --set cmdOptions.dry-run=true

Then watch what it would have done:

kubectl -n kube-system logs -l app.kubernetes.io/name=descheduler -f

You will see lines naming the pods it would evict and the strategy that picked them. Tune your thresholds until the list looks sane, then turn dry run off.

🧯 production safety checklist

Before you let it evict for real, walk this list. ✅

✅ PodDisruptionBudgets everywhere that matters. The descheduler honors them, so a good PDB is your strongest guard against an outage.
✅ nodeFit: true so it never evicts a pod that has nowhere else to go.
✅ minReplicas set so single replica workloads are left alone.
✅ Eviction caps via maxNoOfPodsToEvictPerNode, maxNoOfPodsToEvictPerNamespace and maxNoOfPodsToEvictTotal so a bad run cannot churn the whole cluster.
✅ System namespaces protected, and system critical priority pods left untouched.
✅ Sensible schedule. Every few minutes is rarely needed. Every 15 to 30 minutes is plenty for most clusters.

📊 keep an eye on it

The descheduler exposes Prometheus metrics, served on https://localhost:10258/metrics by default. You can change the address with the --binding-address and --secure-port flags.

Scrape it, then watch how many pods it evicts per run. A healthy cluster should settle into a small, steady number. A number that never drops is a signal that something keeps fighting it. ⚠️

🧠 the gotcha that bites everyone

Here is the classic trap: an eviction and reschedule loop.

It usually looks like this:

You enable a strategy (say node affinity based) to nudge pods toward preferred nodes.
Those preferred nodes are not available yet.
The descheduler evicts the pod, the scheduler puts it right back where it was, and the cycle repeats forever. 🔄

This shows up a lot with spot or autoscaled node pools, where the target nodes are still being provisioned when the eviction happens.

How to avoid the loop:

🧷 Keep nodeFit: true so it will not evict when there is no valid destination.
🐢 Use a calmer schedule so the cluster has time to settle between runs.
🎯 Be careful with preferred affinity rules as eviction triggers, since "preferred" is never fully satisfied and can churn endlessly.
🚦 Cap evictions so even a misconfiguration stays contained.

👉 If you see the same pods evicted over and over in dry run, fix that before going live. Dry run is exactly how you catch this.

🎁 wrapping up

The descheduler is one of those tools that quietly earns its keep. It does not replace your scheduler, it just keeps cleaning up after entropy:

🧹 It evicts poorly placed pods and lets the scheduler re-place them.
🛡️ It respects PDBs and the Default Evictor so it stays polite.
🚀 It installs in minutes with Helm and tunes with a single values file.
🧪 It is safe to trial thanks to dry run.

Start small. Turn on RemoveDuplicates and LowNodeUtilization, run in dry run, read the logs, then let it loose. Your nodes will thank you. 😄

Happy clustering and stay safe! 🧹🚀

DEV Community