In my previous post, I covered how to scale compute (worker) nodes in OpenShift using a semi-automated approach. While OpenShift handled most of the heavy lifting—such as powering on the node via BMC, installing RHCOS, and joining the cluster—the scaling action itself still required manual intervention.
This document focuses on removing that manual step altogether. Specifically, it explores how Cluster Autoscaler and Machine Autoscaler work together to enable automatic, workload-driven scaling of compute nodes. While technologies like Cluster API and Machine API provide the underlying framework for managing machine lifecycles, the real decision-making around when to scale happens at the autoscaler layer.
Before I dive into it, I would like to explain in a line about the building blocks(So you don't think of me of a complete idiot!).
Cluster Autoscaler - responsible for observing the state of workloads in the cluster. It continuously monitors pending and unschedulable pods and determines whether adding (or removing) nodes would help satisfy resource requirements.
Machine Autoscaler - acts as the bridge between high-level scaling decisions and infrastructure changes.
Machinsets- Define how a worker node should be created
Cluster API: Cluster API is a Kubernetes project that provides a declarative way to create, manage, and scale Kubernetes clusters.
Machine API: Red Hat’s opinionated implementation of Cluster API concepts.
MachineSets: Equivalent of a ReplicaSet — but for nodes instead of pods.
The way these different component work together is:
- The cluster Autoscaler watches for unschedulable pods.
- Informs the Machine Autoscaler who updates the machineSets limits.Simply speaking, runs the "oc scale machineset/ replicas=2"
- Our hardworking employee, Machine-API prepares the new node (of course, with help from BMH)
- Pods get scheduled to the new worker node.
We will see how to perform the teeny-tiny set of 4 steps in Openshift:
Note: You can change to the "openshift-machine-api" namespace as most of the steps will be in that namespace:
oc project openshift-machine-api
Let's check our current nodes, machinsets and BMH(That's BareMetalHosts and not Jaspreet Bhumrah!)
oc get nodes
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 8d v1.27.10+28ed2d7
master2 Ready control-plane,master 8d v1.27.10+28ed2d7
master3 Ready control-plane,master 8d v1.27.10+28ed2d7
worker1 Ready worker 3d7h v1.27.10+28ed2d7
oc get machinesets
NAME DESIRED CURRENT READY AVAILABLE AGE
mycluster-7ln8n-worker-0 1 1 1 1 8d
oc get bmh
NAME STATE CONSUMER ONLINE ERROR AGE
master externally provisioned mycluster-7ln8n-master-0 true 8d
master2 externally provisioned mycluster-7ln8n-master-1 true 8d
master3 externally provisioned mycluster-7ln8n-master-2 true 8d
worker1 provisioned mycluster-7ln8n-worker-0-qv4cn true 3d10h
Let's create our cluster Autoscaler and Machineautoscaler manifests.
oc apply -f machineautoscaler.yaml
machineautoscaler.autoscaling.openshift.io/worker-autoscaler created
oc apply -f clusautoscaler.yaml
clusterautoscaler.autoscaling.openshift.io/default created
Note: You can construct the machine autoscaler and cluster autoscaler manifests from Redhat's official documentation
Let's add some load to our cluster. This is just a simple manifest that requests 1G memory and spawns ~20+ replicas.
oc apply -f dep.yaml
deployment.apps/load-test created
This manifest has actually created someone load on the current worker node and needs probably another one to accommodate all the pods. Hence a lot of them are now in "Pending" state.
load-test-65777f99f7-458kp 0/1 Pending 0 2m22s <none> <none> <none> <none>
load-test-65777f99f7-4p72l 1/1 Running 0 2m22s 10.128.2.43 worker1 <none> <none>
load-test-65777f99f7-4v6hq 0/1 Pending 0 2m22s <none> <none> <none> <none>
load-test-65777f99f7-84n79 1/1 Running 0 2m22s 10.128.2.39 worker1 <none> <none>
load-test-65777f99f7-8955k 0/1 Pending 0 2m22s <none> <none> <none> <none>
load-test-65777f99f7-9h4j6 0/1 Pending 0 2m22s <none> <none> <none> <none>
Our Autoscalers have actually updated our machineset:
oc get machineset
NAME DESIRED CURRENT READY AVAILABLE AGE
mycluster-7ln8n-worker-0 2 2 1 1 8d
Which has started provisioning a new node for us:
oc get bmh
NAME STATE CONSUMER ONLINE ERROR AGE
master externally provisioned mycluster-7ln8n-master-0 true 8d
master2 externally provisioned mycluster-7ln8n-master-1 true 8d
master3 externally provisioned mycluster-7ln8n-master-2 true 8d
worker2 provisioning mycluster-7ln8n-worker-0-9vszs true 12m
worker1 provisioned mycluster-7ln8n-worker-0-qv4cn true 3d10h
In no time(Ok, around 20 minutes) you should see your new worker node ready to take workload. The pending pods will move to this new worker gradually.
oc get nodes
NAME STATUS ROLES AGE VERSION
master Ready control-plane,master 8d v1.27.10+28ed2d7
master2 Ready control-plane,master 8d v1.27.10+28ed2d7
master3 Ready control-plane,master 8d v1.27.10+28ed2d7
worker1 Ready worker 3d8h v1.27.10+28ed2d7
worker2 Ready worker 3m14s v1.27.10+28ed2d7
The "oc get pods -o wide" will actually tell you that the workloads are actually moving to this new worker node:
load-test-65777f99f7-458kp 1/1 Running 0 33m 10.131.0.5 worker2 <none>
load-test-65777f99f7-4p72l 1/1 Running 0 33m 10.128.2.43 worker1 <none>
load-test-65777f99f7-84n79 1/1 Running 0 33m 10.128.2.39 worker1 <none> <none>
load-test-65777f99f7-9h4j6 1/1 Running 0 33m 10.131.0.6 worker2 <none> <none>
load-test-65777f99f7-ctvc7 1/1 Running 0 33m 10.128.2.41 worker1 <none> <none>
The autoscalers not only works well for scaling-out, it also works pretty well for scaling-in.
To test this out I deleted the deployment I had created earlier:
oc delete deployment/load-test
deployment.apps "load-test" deleted
We can also check the event logs using 'oc get events'. It actually tells us the entire flow:
24m Normal Killing Pod/load-test-65777f99f7-ctvc7 Stopping container stress
24m Normal Killing Pod/load-test-65777f99f7-hgf2n Stopping container stress
13m Normal ScaleDownEmpty ConfigMap/cluster-autoscaler-status Scale-down: removing empty node "worker2"
13m Normal ScaleDownEmpty ConfigMap/cluster-autoscaler-status Scale-down: empty node worker2 removed
13m Normal DrainProceeds Machine/mycluster-7ln8n-worker-0-9vszs Node drain proceeds
13m (x10 over 66m) Normal SuccessfulUpdate MachineAutoscaler/worker-autoscaler Updated MachineAutoscaler target: openshift-machine-api/mycluster-7ln8n-worker-0
13m Normal Deleted Machine/mycluster-7ln8n-worker-0-9vszs Node "worker2" drained
13m Normal DrainSucceeded Machine/mycluster-7ln8n-worker-0-9vszs Node drain succeeded
13m Normal DeprovisioningStarted BareMetalHost/worker2 Image deprovisioning started
8m34s Normal DeprovisioningComplete BareMetalHost/worker2 Image deprovisioning completed
8m33s Normal PowerOff BareMetalHost/worker2 Host soft powered off
In a nutshell it has:
- On deleting the deployment, the containers were killed.
- It scaled down the workload, we can also check the machineset output.
- Drained the node.
- Deprovisioned and Power-off the node.
user@server1:~/test/manifests$ oc get machineset -n openshift-machine-api
NAME DESIRED CURRENT READY AVAILABLE AGE
mycluster-7ln8n-worker-0 1 1 1 1 8d
virsh list --all
Id Name State
--------------------------
1 master running
2 master2 running
3 master3 running
4 worker1 running
- worker2 shut off
Top comments (0)