waswani for AWS Community Builders

Posted on Jan 1, 2023 • Edited on Jan 28, 2023 • Originally published at waswani.Medium

Autoscaling Nodes in Kubernetes

#productionreadykubernetes #eks #kubernetes #clusterautoscaler

Continuing with our journey of Horizontal Scaling in Kubernetes, here is another blog which will focus on auto scaling of Nodes in Kubernetes cluster. This semantic is also referred to as Cluster AutoScaler (CA).

Please refer to this blog to get more context on how we managed Horizontal Pod Scaling in Kubernetes - Horizontal Pod Scaling in Kubernetes

Though the concept is applicable across all major Cloud Providers, I will be using AWS Cloud provider with AWS EKS managed service as Kubernetes Platform for running the sample code.

To set the context once again — Assume as a Platform Team, you are running a Kubernetes Cluster in a cloud environment and the Application Teams hosting their services have asked you to ensure that their workload pods should automatically scale out as the traffic spikes. You make them aware of the concept of Horizontal Pod Scaling construct and the team implements it. Everyone is happy, but very soon you get into a situation where workload pods while scaling out are getting into Pending state as Kubernetes Cluster does not have enough resources available to schedule the Pods.

To handle this situation, you need a mechanism to automatically add more nodes to the Kubernetes Cluster which can help to ensure that pods don’t go into Pending state. And similarly, if the pods are scaling in because of low traffic, the nodes are also getting scaled in (think terminated).

To implement this semantic is straightforward but honestly, there are quite a few moving parts which need to be managed to get this thing done the right way.

Let’s see what would it take to implement it with AWS EKS service being our Kubernetes environment —

Logical Flow

We need a component (let’s call this component as Cluster Autoscaler from here on) which can monitor if Kubernetes Pods are going in Pending state because of resource crunch. This component needs Kubernetes RBAC access to monitor the pods state.
Once Cluster Autoscaler realizes that pods are going into Pending state because of resource shortage, it launches a new node (EC2 instance). This new node gets attached to the Kubernetes cluster. Note — To launch a new node, this component needs AWS permissions to perform the action. Will talk more on this in some time. Hold on to your thoughts until then !!!
Kubernetes pod Scheduler then starts scheduling the pods to the newly available node(s).

Implementation —

We will be using AWS EKS managed service for hosting our Kubernetes cluster and eksctl tool for creating the cluster.

Step 0 — Install the tools needed to create the Kubernetes infrastructure. The commands are tested with Linux OS.

# Install eksctl tool
curl --silent --location "https://github.com/weaveworks/eksctl/releases/latest/download/eksctl_$(uname -s)_amd64.tar.gz" | tar xz -C /tmp
sudo mv /tmp/eksctl /usr/local/bin

# Install kubectl tool
curl -O https://s3.us-west-2.amazonaws.com/amazon-eks/1.23.13/2022-10-31/bin/linux/amd64/kubectl

# Install or update AWS CLI
curl "https://awscli.amazonaws.com/awscli-exe-linux-x86_64.zip" -o "awscliv2.zip"
unzip awscliv2.zip
sudo ./aws/install

Step 1— Create EKS cluster using eksctl tool and deploy the Kubernetes Metrics service inside it.

Clone this Public repository from Github — https://github.com/waswani/kubernetes-ca, navigate to the folder kubernetes-ca and excute commands as mentioned below —

# Create EKS Cluster with version 1.23
eksctl create cluster -f eks-cluster.yaml

# Output like below shows cluster has been successfully created
2022-12-30 16:26:46 [ℹ]  kubectl command should work with "/home/ec2-user/.kube/config", try 'kubectl get nodes'
2022-12-30 16:26:46 [✔]  EKS cluster "ca-demo" in "us-west-2" region is ready

# Deploy the Metric server
kubectl apply -f https://github.com/kubernetes-sigs/metrics-server/releases/latest/download/components.yaml

# Output of the above command looks something like below - 
serviceaccount/metrics-server created
clusterrole.rbac.authorization.k8s.io/system:aggregated-metrics-reader created
clusterrole.rbac.authorization.k8s.io/system:metrics-server created
rolebinding.rbac.authorization.k8s.io/metrics-server-auth-reader created
clusterrolebinding.rbac.authorization.k8s.io/metrics-server:system:auth-delegator created
clusterrolebinding.rbac.authorization.k8s.io/system:metrics-server created
service/metrics-server created
deployment.apps/metrics-server created
apiservice.apiregistration.k8s.io/v1beta1.metrics.k8s.io created

Step 2 — Create IAM role which can be assumed by Cluster Autoscaler. It helps launch more nodes or delete existing nodes as traffic for the deployed workload goes up and down.

Let’s get a bit into behind the scenes of this step —

When you launch an AWS EKS cluster using the eksctl tool, an AWS Auto Scaling Group (referred as ASG from here on) gets created for each node group configured in the configuration file. We have just one nodeGroup mentioned in the config file, but you can mention multiple. Are you thinking why would one need more than one nodeGroup??? My next blog is going to cover exactly this. Please wait until then.
The ASG created has min, max and desired attributes configured as per the EKS configuration file.
Cluster Autoscaler needs AWS privileges to change the desired attribute of AWS ASG to scale the nodes in or out.
There are two possible ways by which we can assign AWS privileges to the Cluster Autoscaler component.
Option 1 — As our Cluster Autoscaler is going to run as a pod inside the EKS cluster, we can assign the required privileges to the IAM role which is assumed by EKS Cluster nodes. Cluster Autoscaler can then implicitly use that IAM role to perform the job. It works perfectly, but what it also means is that — any pod running inside the EKS cluster can use those privileges. And thus, it violates the principle of Least Privilege.
Option 2 — Assign just enough privileges directly to the Cluster Autoscaler pods in EKS. This concept is called IAM Roles for Service Account (referred as IRSA from here on).
We will use the 2nd approach to make Cluster Autoscaler work. This approach leverages the concept of Kubernetes Service Account and maps this Service Account to the IAM role which has the AWS privileges to launch nodes. This Service Account is made available into the Pod via Pod resource configuration.

When Cluster Autoscaler has a need to add/delete more nodes, it tries to update the AWS ASG desired attribute by calling the AWS ASG API. The AWS SDK behind the scene gets hold of temporary AWS credentials by calling AssumeRoleWithWebIdentity passing the JWT token which was issued to the Cluster Autoscaler pod during its creation.

For more details on IRSA, please check this blog — AWS EKS and the Principle of Least Privilege

Here is the AWS policy needed for manipulating ASG —

{
    "Version": "2012-10-17",
    "Statement": [
        {
            "Action": [
                "autoscaling:DescribeAutoScalingGroups",
                "autoscaling:DescribeAutoScalingInstances",
                "autoscaling:DescribeLaunchConfigurations",
                "autoscaling:DescribeTags",
                "autoscaling:SetDesiredCapacity",
                "autoscaling:TerminateInstanceInAutoScalingGroup",
                "ec2:DescribeInstanceTypes",
                "ec2:DescribeLaunchTemplateVersions"
            ],
            "Resource": "*",
            "Effect": "Allow"
        }
    ]
}

In the below code snippet, we are going to create AWS IAM policy using the JSON mentioned above, plus AWS IAM Role and Kubernetes Service account using — eksctl create iamserviceaccount command.

# Create IAM Policy
aws iam create-policy --policy-name ca-asg-policy --policy-document file://ca-asg-policy.json 

# Export the AWS Account ID to be used in the next command
export ACCOUNT_ID=$(aws sts get-caller-identity --output text --query Account)

# Create Kubernetes Service Account and IAM Role account using eksctl tool
eksctl create iamserviceaccount --name cluster-autoscaler-sa \
--namespace kube-system --cluster ca-demo \
--attach-policy-arn "arn:aws:iam::${ACCOUNT_ID}:policy/ca-asg-policy" \
--approve --override-existing-serviceaccounts

# To verify the Service account got created successfully, execute below command
kubectl describe serviceaccount cluster-autoscaler-sa -n kube-system

# And the output looks something like this
Name:                cluster-autoscaler-sa
Namespace:           kube-system
Labels:              app.kubernetes.io/managed-by=eksctl
Annotations:         eks.amazonaws.com/role-arn: arn:aws:iam::108756121336:role/eksctl-ca-demo-addon-iamserviceaccount-kube-Role1-22KGUNWA2E22
Image pull secrets:  <none>
Mountable secrets:   cluster-autoscaler-sa-token-gmmfz
Tokens:              cluster-autoscaler-sa-token-gmmfz
Events:              <none>

The output of the kubectl describe service account shows an important construct — Annotations. The mapping between the AWS IAM role and the service account is done using Annotations.

Step 3 — Deploy the Cluster Autoscaler component which is going to perform the magic for us.

# Deploy Cluster Autoscaler using below command
kubectl apply -f cluster-autoscaler-aws.yaml 

# The output shows all the Kubernetes native objects which gets 
# created as part of this deployment
clusterrole.rbac.authorization.k8s.io/cluster-autoscaler created
role.rbac.authorization.k8s.io/cluster-autoscaler created
clusterrolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
rolebinding.rbac.authorization.k8s.io/cluster-autoscaler created
deployment.apps/cluster-autoscaler created

Key points to remember —

The Cluster Autoscaler Docker image to be used is dependent on the EKS Kubernetes version. For us, it’s 1.23.0, hence we used — k8s.gcr.io/autoscaling/cluster-autoscaler:v1.23.0
Service account created in step #2 is passed as serviceAccountName in the Cluster Autoscaler pod template.
An ASG which gets created as part of EKS cluster Node Group is configured with 2 standard tags mentioned below.

k8s.io/cluster-autoscaler/ca-demo = owned
k8s.io/cluster-autoscaler/enabled = true

These tags help Cluster Autoscaler to discover the ASG(s) which can be used for scaling. ASG which does not have these tags attached, will not be discovered by Cluster Autoscaler.

Step 4 — Let’s deploy the workload which will help to scale the cluster as pods start getting into pending state.

# Deploy the sample application
kubectl apply -f apache-php-deployment.yaml 

# Expose the deployed application via Service, listening on port 80
kubectl apply -f apache-php-service.yaml

# Create HPA
kubectl apply -f apache-php-hpa.yaml

We are using HPA concept from our previous blog — Horizontal Pod Scaling. Do take a look at that blog if you have any confusion around that.

Before you execute the next step, open 3 different terminals and fire below commands —

Terminal 1 — Get current nodes running in the cluster

kubectl get nodes -w

# The initial output would show you 2 nodes
NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-43-152.us-west-2.compute.internal   Ready    <none>   20h   v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready    <none>   20h   v1.23.13-eks-fb459a0

Terminal 2 — Get current pods running in the default namespace

kubectl get pods -w

# The output should you just single pod running
NAME                          READY   STATUS    RESTARTS   AGE
apache-php-65f9f58f4f-zrnnf   1/1     Running   0          6m45s

Terminal 3 — Get the Horizontal Pod Scale resource configured in the default namespace.

kubectl get hpa -w

# You should see something like this to start with
NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/20%    1         10        1          37m

Step 5 — And now let’s increase the load on the service which will first trigger Horizontal Pod scale and then as Pods go in Pending state, Cluster Autoscaler comes into action and adds more nodes into the cluster and pods go in Running state.

# Launch a busybox pod and fire request to workload service in an infite loop
kubectl run -i --tty load-gen --image=busybox -- /bin/sh

# And from the shell command, fire below command
while true; wget -q -O - http://apache-php ; done

# And you should see output like this
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!OK!
...

Here is the observation —

As load increases, HPA tries to launch more Pods to ensure the avg. target CPU consumption is maintained at 20 %.
As HPA tries to add more pods, EKS Cluster goes out of CPU resources and Pods goes into Pending state.
Pods going into Pending state triggers Cluster Autoscaler to add more Nodes. And once nodes join the EKS cluster and come to Ready state, Pods which were in Pending state now go into Running state.

Look at the output of each Terminal once again —

Terminal 1 — Additional node got launched as part of Cluster Autoscaling.

kubectl get nodes -w

NAME                                           STATUS   ROLES    AGE   VERSION
ip-192-168-43-152.us-west-2.compute.internal   Ready    <none>   20h   v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready    <none>   20h   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   0s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   2s    v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   10s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady   <none>   18s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   31s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   31s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   32s   v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   62s   v1.23.13-eks-fb459a0

Terminal 2 — Get status of pods in the cluster.

kubectl get pods -w

NAME                          READY   STATUS    RESTARTS   AGE
apache-php-65f9f58f4f-zrnnf   1/1     Running   0          8m47s
load-gen                      1/1     Running   0          13s
apache-php-65f9f58f4f-q8qsg   0/1     Pending   0          0s
apache-php-65f9f58f4f-q8qsg   0/1     Pending   0          0s
apache-php-65f9f58f4f-mz4w4   0/1     Pending   0          0s
apache-php-65f9f58f4f-6j7mg   0/1     Pending   0          0s
apache-php-65f9f58f4f-mz4w4   0/1     Pending   0          0s
apache-php-65f9f58f4f-6j7mg   0/1     Pending   0          0s
apache-php-65f9f58f4f-q8qsg   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-mz4w4   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-6j7mg   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-mz4w4   1/1     Running             0          2s
apache-php-65f9f58f4f-6j7mg   1/1     Running             0          2s
apache-php-65f9f58f4f-q8qsg   1/1     Running             0          2s
apache-php-65f9f58f4f-h8z2p   0/1     Pending             0          0s
apache-php-65f9f58f4f-h8z2p   0/1     Pending             0          0s
apache-php-65f9f58f4f-g45qg   0/1     Pending             0          0s
apache-php-65f9f58f4f-t6b6b   0/1     Pending             0          0s
apache-php-65f9f58f4f-g45qg   0/1     Pending             0          0s
apache-php-65f9f58f4f-t6b6b   0/1     Pending             0          0s
apache-php-65f9f58f4f-ncll2   0/1     Pending             0          0s
apache-php-65f9f58f4f-ncll2   0/1     Pending             0          0s
apache-php-65f9f58f4f-h8z2p   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-g45qg   0/1     ContainerCreating   0          0s
apache-php-65f9f58f4f-h8z2p   1/1     Running             0          2s
apache-php-65f9f58f4f-g45qg   1/1     Running             0          2s
apache-php-65f9f58f4f-hctnf   0/1     Pending             0          0s
apache-php-65f9f58f4f-hctnf   0/1     Pending             0          0s
apache-php-65f9f58f4f-ncll2   0/1     Pending             0          75s
apache-php-65f9f58f4f-t6b6b   0/1     Pending             0          75s
apache-php-65f9f58f4f-hctnf   0/1     Pending             0          30s
apache-php-65f9f58f4f-ncll2   0/1     ContainerCreating   0          75s
apache-php-65f9f58f4f-t6b6b   0/1     ContainerCreating   0          75s
apache-php-65f9f58f4f-hctnf   0/1     ContainerCreating   0          30s
apache-php-65f9f58f4f-ncll2   1/1     Running             0          91s
apache-php-65f9f58f4f-t6b6b   1/1     Running             0          91s
apache-php-65f9f58f4f-hctnf   1/1     Running             0          47s

Terminal 3 — Additional pods were launched as part of Horizontal Pod scaling.

kubectl get hpa -w

NAME         REFERENCE               TARGETS   MINPODS   MAXPODS   REPLICAS   AGE
apache-php   Deployment/apache-php   0%/20%    1         10        1          37m
apache-php   Deployment/apache-php   149%/20%   1         10        1          38m
apache-php   Deployment/apache-php   149%/20%   1         10        4          38m
apache-php   Deployment/apache-php   111%/20%   1         10        8          38m
apache-php   Deployment/apache-php   23%/20%    1         10        8          38m
apache-php   Deployment/apache-php   30%/20%    1         10        8          39m
apache-php   Deployment/apache-php   27%/20%    1         10        9          39m
apache-php   Deployment/apache-php   29%/20%    1         10        9          39m
apache-php   Deployment/apache-php   28%/20%    1         10        9          39m
apache-php   Deployment/apache-php   26%/20%    1         10        9          40m
apache-php   Deployment/apache-php   19%/20%    1         10        9          40m

Stop the busybox container which is bombarding the service and wait for few minutes, you would see the node(s) are getting terminated.

ip-192-168-28-98.us-west-2.compute.internal    Ready      <none>   17m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready,SchedulingDisabled   <none>   17m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    Ready,SchedulingDisabled   <none>   17m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady,SchedulingDisabled   <none>   18m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady,SchedulingDisabled   <none>   18m     v1.23.13-eks-fb459a0
ip-192-168-28-98.us-west-2.compute.internal    NotReady,SchedulingDisabled   <none>   18m     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-88-36.us-west-2.compute.internal    Ready                         <none>   21h     v1.23.13-eks-fb459a0
ip-192-168-43-152.us-west-2.compute.internal   Ready                         <none>   21h     v1.23.13-eks-fb459a0

Step 6 — One of the most important step, to delete the Kubernetes cluster.

# Delete EKS Cluster 
eksctl delete cluster -f eks-cluster.yaml

With this blog, we covered the Horizontal Scaling for Nodes in EKS Cluster. The next blogs will be covering —

Scale the Pods based on Custom Metrics
Get alert on Slack if Pods go in Pending state
Kubernetes to host services with different resource requirement. This will help to cover the point I mentioned above — why would you need to add more than one nodeGroup in the cluster configuration file???

Hope you enjoyed reading this blog. Do share this blog with your friends if this has helped you in any way.

Happy Blogging…Cheers!!!

DEV Community

Autoscaling Nodes in Kubernetes

Top comments (0)