Ashish Nair

Posted on Apr 27

Autoscaling in Openshift with Cluster Autoscaler and Machine Autoscaler

#automation #devops #kubernetes #tutorial

In my previous post, I covered how to scale compute (worker) nodes in OpenShift using a semi-automated approach. While OpenShift handled most of the heavy lifting—such as powering on the node via BMC, installing RHCOS, and joining the cluster—the scaling action itself still required manual intervention.

This document focuses on removing that manual step altogether. Specifically, it explores how Cluster Autoscaler and Machine Autoscaler work together to enable automatic, workload-driven scaling of compute nodes. While technologies like Cluster API and Machine API provide the underlying framework for managing machine lifecycles, the real decision-making around when to scale happens at the autoscaler layer.

Before I dive into it, I would like to explain in a line about the building blocks(So you don't think of me of a complete idiot!).

Cluster Autoscaler - responsible for observing the state of workloads in the cluster. It continuously monitors pending and unschedulable pods and determines whether adding (or removing) nodes would help satisfy resource requirements.

Machine Autoscaler - acts as the bridge between high-level scaling decisions and infrastructure changes.

Machinsets- Define how a worker node should be created

Cluster API: Cluster API is a Kubernetes project that provides a declarative way to create, manage, and scale Kubernetes clusters.

Machine API: Red Hat’s opinionated implementation of Cluster API concepts.

MachineSets: Equivalent of a ReplicaSet — but for nodes instead of pods.

The way these different component work together is:

The cluster Autoscaler watches for unschedulable pods.
Informs the Machine Autoscaler who updates the machineSets limits.Simply speaking, runs the "oc scale machineset/ replicas=2"
Our hardworking employee, Machine-API prepares the new node (of course, with help from BMH)
Pods get scheduled to the new worker node.

We will see how to perform the teeny-tiny set of 4 steps in Openshift:

Note: You can change to the "openshift-machine-api" namespace as most of the steps will be in that namespace:

oc project openshift-machine-api

Let's check our current nodes, machinsets and BMH(That's BareMetalHosts and not Jaspreet Bhumrah!)

oc get nodes
NAME      STATUS   ROLES                  AGE    VERSION
master    Ready    control-plane,master   8d     v1.27.10+28ed2d7
master2   Ready    control-plane,master   8d     v1.27.10+28ed2d7
master3   Ready    control-plane,master   8d     v1.27.10+28ed2d7
worker1   Ready    worker                 3d7h   v1.27.10+28ed2d7

oc get machinesets
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
mycluster-7ln8n-worker-0   1         1         1       1           8d

oc get bmh
NAME      STATE                    CONSUMER                         ONLINE   ERROR                AGE
master    externally provisioned   mycluster-7ln8n-master-0         true        8d
master2   externally provisioned   mycluster-7ln8n-master-1         true        8d
master3   externally provisioned   mycluster-7ln8n-master-2         true        8d
worker1   provisioned              mycluster-7ln8n-worker-0-qv4cn   true        3d10h

Let's create our cluster Autoscaler and Machineautoscaler manifests.

oc apply -f machineautoscaler.yaml 
machineautoscaler.autoscaling.openshift.io/worker-autoscaler created

oc apply -f clusautoscaler.yaml 
clusterautoscaler.autoscaling.openshift.io/default created

Note: You can construct the machine autoscaler and cluster autoscaler manifests from Redhat's official documentation

Let's add some load to our cluster. This is just a simple manifest that requests 1G memory and spawns ~20+ replicas.

oc apply -f dep.yaml 
deployment.apps/load-test created

This manifest has actually created someone load on the current worker node and needs probably another one to accommodate all the pods. Hence a lot of them are now in "Pending" state.

load-test-65777f99f7-458kp                            0/1     Pending   0          2m22s   <none>            <none>    <none>           <none>
load-test-65777f99f7-4p72l                            1/1     Running   0          2m22s   10.128.2.43       worker1   <none>           <none>
load-test-65777f99f7-4v6hq                            0/1     Pending   0          2m22s   <none>            <none>    <none>           <none>
load-test-65777f99f7-84n79                            1/1     Running   0          2m22s   10.128.2.39       worker1   <none>           <none>
load-test-65777f99f7-8955k                            0/1     Pending   0          2m22s   <none>            <none>    <none>           <none>
load-test-65777f99f7-9h4j6                            0/1     Pending   0          2m22s   <none>            <none>    <none>           <none>

Our Autoscalers have actually updated our machineset:

oc get machineset
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
mycluster-7ln8n-worker-0   2         2         1       1           8d

Which has started provisioning a new node for us:

oc get bmh
NAME      STATE                    CONSUMER                         ONLINE   ERROR                AGE
master    externally provisioned   mycluster-7ln8n-master-0         true        8d
master2   externally provisioned   mycluster-7ln8n-master-1         true        8d
master3   externally provisioned   mycluster-7ln8n-master-2         true        8d
worker2   provisioning             mycluster-7ln8n-worker-0-9vszs   true                          12m
worker1   provisioned              mycluster-7ln8n-worker-0-qv4cn   true                          3d10h

In no time(Ok, around 20 minutes) you should see your new worker node ready to take workload. The pending pods will move to this new worker gradually.

oc get nodes
NAME      STATUS   ROLES                  AGE     VERSION
master    Ready    control-plane,master   8d      v1.27.10+28ed2d7
master2   Ready    control-plane,master   8d      v1.27.10+28ed2d7
master3   Ready    control-plane,master   8d      v1.27.10+28ed2d7
worker1   Ready    worker                 3d8h    v1.27.10+28ed2d7
worker2   Ready    worker                 3m14s   v1.27.10+28ed2d7

The "oc get pods -o wide" will actually tell you that the workloads are actually moving to this new worker node:

load-test-65777f99f7-458kp                            1/1     Running   0          33m     10.131.0.5        worker2   <none>           
load-test-65777f99f7-4p72l                            1/1     Running   0          33m     10.128.2.43       worker1   <none> 
load-test-65777f99f7-84n79                            1/1     Running   0          33m     10.128.2.39       worker1   <none>           <none>
load-test-65777f99f7-9h4j6                            1/1     Running   0          33m     10.131.0.6        worker2   <none>           <none>
load-test-65777f99f7-ctvc7                            1/1     Running   0          33m     10.128.2.41       worker1   <none>           <none>

The autoscalers not only works well for scaling-out, it also works pretty well for scaling-in.

To test this out I deleted the deployment I had created earlier:

oc delete deployment/load-test
deployment.apps "load-test" deleted

We can also check the event logs using 'oc get events'. It actually tells us the entire flow:

24m                  Normal    Killing                  Pod/load-test-65777f99f7-ctvc7                     Stopping container stress
24m                  Normal    Killing                  Pod/load-test-65777f99f7-hgf2n                     Stopping container stress

13m                  Normal    ScaleDownEmpty           ConfigMap/cluster-autoscaler-status                Scale-down: removing empty node "worker2"
13m                  Normal    ScaleDownEmpty           ConfigMap/cluster-autoscaler-status                Scale-down: empty node worker2 removed
13m                  Normal    DrainProceeds            Machine/mycluster-7ln8n-worker-0-9vszs             Node drain proceeds
13m (x10 over 66m)   Normal    SuccessfulUpdate         MachineAutoscaler/worker-autoscaler                Updated MachineAutoscaler target: openshift-machine-api/mycluster-7ln8n-worker-0
13m                  Normal    Deleted                  Machine/mycluster-7ln8n-worker-0-9vszs             Node "worker2" drained
13m                  Normal    DrainSucceeded           Machine/mycluster-7ln8n-worker-0-9vszs             Node drain succeeded
13m                  Normal    DeprovisioningStarted    BareMetalHost/worker2                              Image deprovisioning started

8m34s                Normal    DeprovisioningComplete   BareMetalHost/worker2                              Image deprovisioning completed
8m33s                Normal    PowerOff                 BareMetalHost/worker2                              Host soft powered off

In a nutshell it has:

On deleting the deployment, the containers were killed.
It scaled down the workload, we can also check the machineset output.
Drained the node.
Deprovisioned and Power-off the node.

user@server1:~/test/manifests$ oc get machineset -n openshift-machine-api
NAME                       DESIRED   CURRENT   READY   AVAILABLE   AGE
mycluster-7ln8n-worker-0   1         1         1       1           8d

virsh list --all
 Id   Name      State
--------------------------
 1    master    running
 2    master2   running
 3    master3   running
 4    worker1   running
 -    worker2   shut off

DEV Community

Autoscaling in Openshift with Cluster Autoscaler and Machine Autoscaler

Top comments (0)